Friday, 6th May 2011
Today I wrote a small Python program that is likely to be of interest to almost no one else. Still, I like it because it uses modules I've not used much before and has the potential to be very useful if only could think of a suitable project. The program goes to Victoria's blog and find out what she was doing this time last year by parsing the relavant page in her Today section.
The first step is to get the current data and substract one from the year. The datetime module makes this very straightforward.
from datetime import date today = date.today() last_year = today.replace(year = today.year - 1)
The next step is to visit the relavent URL and get the HTML. Naturally, Victoria has given the pages of her blog sensible URLs, which means we can plug the date directly into the URL.
import urllib last_year_url = "http://blog.victoriac.net/today/%s" % last_year sock = urllib.urlopen(last_year_url) HTML = sock.read() sock.close()
The final step is to parse the HTML, which is a bit more complex. There are various way to do this, but the most obvious is to use the HTMLParser module. This module is unlike others I've used in that it works by creating a class that you inherit and then overwrite the methods you want.
from HTMLParser import HTMLParser class MyParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.isHeader = False self.today_string = '' def parse(self, url): self.feed(url) self.close() def handle_starttag(self, tag, attrs): if tag == 'h1': self.isHeader = True def handle_data(self, data): if self.isHeader and data: self.today_string += data def handle_endtag(self, tag): if tag == 'h1': self.isHeader = False
The methods we overwrite are those that handle the start tags, the data in between and the end tags. When you call the
feed() function, the
handle_endtag functions will be called when the relevant parts of the HTML are found. To get the information we want, we just need to find when the <h1> tag starts and record all the data until the </h1> end tag is reached. I record the data in a string called
Finally we create an instance of our parser and send the HTML to it.
parser = MyParser() parser.parse(HTML) print parser.today_string
It turns out that on this day last year Victoria voted in the UK General Election. I did too and voted for Dr. Evan Harris (who narrowly lost). By coincidence, I named this program Retrospectroscope after seeing Dr. Evan Harris mention it on Twitter. It seemed apt somehow.