Retrospectroscope

Today I wrote a small Python program that is likely to be of interest to almost no one else. Still, I like it because it uses modules I've not used much before and has the potential to be very useful if only could think of a suitable project. The program goes to Victoria's blog and find out what she was doing this time last year by parsing the relavant page in her Today section.

datetime

The first step is to get the current data and substract one from the year. The datetime module makes this very straightforward.

from datetime import date
today = date.today()
last_year = today.replace(year = today.year - 1)

urllib

The next step is to visit the relavent URL and get the HTML. Naturally, Victoria has given the pages of her blog sensible URLs, which means we can plug the date directly into the URL.

import urllib
last_year_url = "http://blog.victoriac.net/today/%s" % last_year
sock = urllib.urlopen(last_year_url)
HTML = sock.read()
sock.close()

HTMLParser

The final step is to parse the HTML, which is a bit more complex. There are various way to do this, but the most obvious is to use the HTMLParser module. This module is unlike others I've used in that it works by creating a class that you inherit and then overwrite the methods you want.

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.isHeader = False        
        self.today_string = ''

    def parse(self, url):
        self.feed(url)
        self.close()

    def handle_starttag(self, tag, attrs):
        if tag == 'h1':
            self.isHeader = True
            
    def handle_data(self, data):
        if self.isHeader and data:
            self.today_string += data
            
    def handle_endtag(self, tag):
        if tag == 'h1':
            self.isHeader = False

The methods we overwrite are those that handle the start tags, the data in between and the end tags. When you call the feed() function, the handle_starttag, handle_data and handle_endtag functions will be called when the relevant parts of the HTML are found. To get the information we want, we just need to find when the <h1> tag starts and record all the data until the </h1> end tag is reached. I record the data in a string called self.today_string.

Finally we create an instance of our parser and send the HTML to it.

parser = MyParser()
parser.parse(HTML)
print parser.today_string

It turns out that on this day last year Victoria voted in the UK General Election. I did too and voted for Dr. Evan Harris (who narrowly lost). By coincidence, I named this program Retrospectroscope after seeing Dr. Evan Harris mention it on Twitter. It seemed apt somehow.

Post new comment

The content of this field is kept private and will not be shown publicly.