Convert from HTML to XML with HTML Tidy

For few days I was involved with WSO2 Mashup Server 2.0 release documentation, giving a hand to the mashup team. Documentation is a painful task, but when comes to open source what matters mostly is documentation :D.
Last night I had to convert a bunch of html files (some Java Api Docs) to xml in-order to port into maven site. Formatting 30+ html files to xml !@#$%^&*@% :D. So I was googleing for a tool to automate the task. With few clicks here and there I found a nice article in Big Blue‘s developer works site, a tool called “Tidy“. When I tried to download and use I figure out that you can straight away apt-get the package and use. So,

sudo apt-get install tidy

and your box is now equiped with the tool, and can be accessed via the shell.

tidy -asxhtml -numeric  index.xml

but who wants to convert file by file when you have such a nice tool, so I spent few minutes in writing a tiny shell script to get the job done, the snippet is,

#!/bin/bash
for file in $(find $1 -type f -iname '*.html'); do
	myf=`echo $file | sed 's/html/xml/g'`
	tidy -asxhtml -numeric  $myf
done

All looked good, worked fine. However in my Api Docs I had, had few special tags, custom to our Mashup Apis (<imconfig>, <yahoo>, <mail:config>). Tidy gave error for these files since the tags are not recognized.

In such a case you can train Tidy for new tags, by adding few lines to the tidy configuration file. (/etc/tidy.config – You can also give your own config file at the prompt)

new-pre-tags: imconfig, yahoo, msn, aim, icq, jabber, username, password

There are whole bunch of tweeks you can do with tidy, [1], [2] and [3] are some useful links that you can read up when using the tool.

[1] : http://www.ibm.com/developerworks/library/x-tiptidy.html
[2] : http://tidy.sourceforge.net/
[3] : http://tidy.sourceforge.net/docs/tidy_man.html

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s