For few days I was involved with WSO2 Mashup Server 2.0 release documentation, giving a hand to the mashup team. Documentation is a painful task, but when comes to open source what matters mostly is documentation :D.
Last night I had to convert a bunch of html files (some Java Api Docs) to xml in-order to port into maven site. Formatting 30+ html files to xml !@#$%^&*@% :D. So I was googleing for a tool to automate the task. With few clicks here and there I found a nice article in Big Blue‘s developer works site, a tool called “Tidy“. When I tried to download and use I figure out that you can straight away apt-get the package and use. So,
sudo apt-get install tidy
and your box is now equiped with the tool, and can be accessed via the shell.
tidy -asxhtml -numeric index.xml
but who wants to convert file by file when you have such a nice tool, so I spent few minutes in writing a tiny shell script to get the job done, the snippet is,
for file in $(find $1 -type f -iname '*.html'); do
myf=`echo $file | sed 's/html/xml/g'`
tidy -asxhtml -numeric $myf
All looked good, worked fine. However in my Api Docs I had, had few special tags, custom to our Mashup Apis (<imconfig>, <yahoo>, <mail:config>). Tidy gave error for these files since the tags are not recognized.
In such a case you can train Tidy for new tags, by adding few lines to the tidy configuration file. (/etc/tidy.config – You can also give your own config file at the prompt)
new-pre-tags: imconfig, yahoo, msn, aim, icq, jabber, username, password
There are whole bunch of tweeks you can do with tidy, ,  and  are some useful links that you can read up when using the tool.
 : http://www.ibm.com/developerworks/library/x-tiptidy.html
 : http://tidy.sourceforge.net/
 : http://tidy.sourceforge.net/docs/tidy_man.html