Web Scraping & Parsing HTML to XML in Javascript

Today I was working on a customer POC and happened to create few Google gadgets to visualize selected data sets from *.gov.uk sites. The scenario which is implemented was, mixed with inter-gadget communication and content search over data.gov.uk sites. I created three simple gadgets which communicates with each other, and one acted as the controlling gadget which pushed the search parameters to other two gadgets. The two content gadgets showed UK (1) primary school information and (2) electoral information. The pushed parameter was the postal code of different parts of UK. The direct.gov.uk has a form based implementation of this.

The Requirements for the POC was, simple and we already had working samples of such a scenario at WSO2 library.

  1. Show how one gadget can pass the context to other gadgets
  2. How gadgets can harvest data in various formats (in my previous post I explained on how to get data from RDF endpoints, which are also available in *.gov.uk sites)

The building blocks for the implementation was the search url, which was quite straight forward. for all the requests based on postal codes the direct.gov site served in the same manner (because of this important fact, the automation process became trivial). for an instance the url for primary school information retrial was,

http://local.direct.gov.uk/LDGRedirect/LocationSearch.do?LGSL=13&searchtype=1&LGIL=8&Style=&formsub=t&text=SE1+7DU

Where the param “text” changed according to the postal code. So far everything seemed straight forward, however at implementation, while using Gadgets API for content retrial, I faced problems in parsing text with javascript. Hence the gadgets.io.makeRequest supported HTML as text and the API method returned the retrieved HTML document as string making it quite impossible to process.

With some thinking and advise, I brought the Mashup Server in to the picture and used it to retrieve the data from the gov site and returned the result in XML format. Using the Mashup Server web scraping seems to be a piece of cake, We created a simple mashup using the scraper host-object and captured the result set in the search result page. The mashup code as follows,

function search(searchUrl) {
	var scraper = new Scraper(
		
		    {searchUrl}
			
			    
				
				   
				
			     
			
		
	);
	return new XMLList(scraper.response);
}

And finally the two gadgets were making service calls to the mashup service and retrieved the data as an XML object, making the data processing painless. The final version at the Gadget Server looked quite appealing.

WSO2 Gadget Server with UK gov data
Gadget Server look - in the end

Special thanks goes to Ruchira for helping me out with the mashup service šŸ™‚ You can download the Gadget code and the Mashup service and try the scenario yourself.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s