Last week was a busy one for me. I presented on data journalism at the inaugural meeting of the Ottawa chapter of Hacks & Hackers and also did two sessions at the Canadian Association of Journalists convention in Ottawa.
I promised I would post some of the resources. First off, here’s the Powerpoint deck from my talk at Hacks & Hackers. The deck is pretty bare bones but it gives you the broad strokes. This guy took great notes of the event and made me sound way more articulate than I really was.
The CAJ session on web-scraping drew a lot of follow-up inquiries. The deck is here.
Not included is the clip from The Social Network I played to begin. It was a scene with the Mark Zuckerberg character web-scraping all the house face book pages at Harvard when he was student. He uses Wget to extract pictures of other students and, importantly, discusses how each house web page is configured differently, requiring a different approach to scraping it. As far as I know, it’s the only web-scraping scene ever in a film nominated for Best Picture.
And here are the Python scripts I showed that scrape the federal government’s Orders in Council database and the Economic Action Plan website. Feel free to mash these scripts up however you like.
Before you dig in, however, I recommend learning a bit about Python. The O’Reilly books on Python are a good place to start. I bought the one with the giant rat on the cover. Most Chapters or Indigo stores will have these.
I don’t use Ruby but that might work better for some people. Perl and PHP can also be used for scraping, but if you’re going to learn to code from scratch, choose a modern language.
The two Python scripts I have posted use functions, which are like little sub-routines that run within the script. When you’re trying to understand how the scripts work, try reading from the bottom up.
Also, the functions that extract the good stuff from the HTML use Regular Expressions to parse the data. Many scrapers prefer to use a Python module called Beautiful Soup. It makes parsing easier but I haven’t learned it yet and like the specific control RegEx gives.
Also, the scrape routines use Python built-in modules urllib and urllib2 to retrieve web pages. You may find you want to use another module called mechanize.
The Firefox plug-in for monitoring traffic between your browser and the web-server you are trying to scrape is called Firebug. It’s really handy for figuring out how the server is handling page requests.
All the documentation for this stuff is online and the software is all open-source.
Fair warning: If you are new to programming, like I am, scraping isn’t easy. To do it well, you need to learn to write computer programs. It takes a lot of trial-and-error and every website you scrape is slightly different. Some pages are a cakewalk. But others, like those with .aspx extenders on their pages, can be a huge pain in the ass the require figuring out cookie handling and other nightmares.
If you are a journalist with questions about web-scraping and need some help, drop me an email and I can try to talk you through it. And if you want me to come to speak to your group about scraping for journalists or data journalism, let me know. Scrape it to the man.