On March 19th, 2011, I have attended a conference called The Big Clean here in Prague. It was organised in two cities - Prague (Czech) and Jyväskylä (Finland) - as a pre-event of a global event that will be organised later this year.
The Big Clean is a a gathering of citizens in cities around the world to access, clean, organise & re-use local Government Data.
It was a very refreshing day, with quite a lot of new things to me, and I also happened to scrape, clean and visualize data with Drupal and Scraperwiki.
During the conference there were two tracks, in the main room there were talks about useful tools such as Scraperwiki, Google Refine, Google Fusion Tables, the importance of data-driven journalism and other data related talks. In the second room a workshop has been held. I was in the main room all the time, but still managed to write my own scraper (my first python script ever! ;) and fed it into Drupal and visualize it.
The government data that are not that accessible
We do all like APIs and structured, clean data that can be easily parsed, however governments (at least in Czech republic) are quite slow and not much innovative in this area. The data are in the database somewhere, how hard it can be to make them available in any other format than doc, html website, xls, etc...*sigh* and that's why there are scraping tools such Scraperwiki (http://scraperwiki.org).
The data I was working with
The task was to choose "unaccessible" data and scrape them. It didn't take me long to find one while browsing the Ministry of the interior of the Czech republic website. So I've found the Public Collections page that had information about all collections ever organised in Czech republic. I thought it could be quite interesting, especially for the still running collections to display these data at one place and on the map later. Maybe people are looking to make a donation in theirs area but don't know about these collections at all.
Give me the data
On that page with public collections there is an option to "Display all" which takes you to the paged list of collections (15 per page), with just a title and link to the collection details. Becuase it was a form that got us here, the url remained same, and yet it didn't provide any useful data. We need the detail. Clicking the link to get to the detail was what I was after. The url had a unique id of the collection http://aplikace.mvcr.cz/seznam-verejnych-sbirek/Detail.aspx?id=2581, easy enough to replace and get to any collection we want. But wait! It's html! No option to download that data in any other format.
Scraping the html data is possible even with Drupal (Feeds Xpath parser), however I do like how scraperwiki work. Basically, if your scraper breaks down (for any reason, most likely that the source will change the output), anyone can pick it up and fix it (wiki approach) so any other application that is dependant on this scraper will work, thus making the data available globally.
So I did my first python script (not saying it's perfect nor finished!).
Everytime it runs it does a for loop on 500 records, requesting the pages with id=$i. I was enough lucky that each row of the data had a unique id, so it was quite easy to get the data out through these selectors:
<td style="width: 50%; margin: 0; padding: 0; padding-left: 15px;">
<span id="ctl00_Application_LBL_OSOBA">Název právnické osoby</span>:
<strong><span id="ctl00_Application_LBL_OSOBA_VALUE">"Atma Do"</span></strong>
Simple as that you are able to say "I want a value of id ctl00_Application_LBL_OSOBA_VALUE" which is the id of the element holding the value. There is also a PHP class that can do this called DOMXpath or Simple HTML DOM. One other challenge was, because there are around 4100+ pages, it took too much time and usually expired after 800 pages scraped and the script stopped running. But you can use internal variables in scraperwiki, to store last id that has been scraped and later when the scraper runs again you start from where you finished.
So we just made the dataset available and open to everyone out there, not just our application.
Feeding the data into Drupal
The other idea I had was to make a public website which will visualize the dataset on the map and everyone can browse, search through it too.
Feeds module, an ultimate tool to get the data into your Drupal installation from any source (depends on the parser that are available to you). Scraperwiki allows you to access your scraped data via json or CSV (http://scraperwiki.com/api/1.0/explore/scraperwiki.datastore.sqlite?name...) or download the whole dataset in sqlite. With CSV or json call you are able to get exact amount of rows, so again you can request 500 rows starting from X, map them to your node fields and then do whatever you like with them in Drupal.
I will be covering the feeds, location, gmap and other Drupal modules that I've used in the next part.
Now go ahead and find your government data that are worth scraping and make them available. :) Feel free to share your scrapers here in the comments.
The Drupal website with public collections can be found here: http://verejnesbirky.dev.atomicant.co.uk