Sunday, 30th August 2009
Mining the Guardian Data Store
After going to Open Tech in July, I was inspired to attempt to do something useful with the large volumes of data that various organisations are making available on the web. I decided that the Guardian Data Store was as good a place as any to begin. The Guardian Data Store contains numerous spreadsheets (as Google documents), which contain the raw data relating to a story in the paper. You can then use this data to do with what you will. For example, many people have created their own visualisation (i.e. graphs) of the data and uploaded them to a specific group on Flickr.
I’m find searching for ways to display large quantities data in a clear and informative manner pretty interesting and have recently been honing my Adobe Illustrator skills as part of my thesis-writing. However, I thought it might be more fun to combine the data from various different stories to see if I could find something novel in the available information (then I could think of a pretty way to display it). Firstly, I needed to identify a common variable between a group of datasets so I could compare them. Countries seemed like a good starting place, as many of the spreadsheets contain information relating to some, if not all, countries. Another option would be counties of the UK, which turn up quite a lot, and time, generally in years. Time data can also be nicely plotted at Timetric, but though I have an account, I have yet to play about there.
Another reason for using country data was that at the Guardian Open Tech talk, one speaker mentioned that someone had combined data about drug use in various countries with the happiness of various countries and found a small positive correlation. I thought that looking for similar, unexpected correlations might be fun. Rather than thinking of different variables to compare between countries, I figured why not compare every variable with every other variable? There is of course a very good reason why not to, but I wasn’t going to let that stop me.
The reason not to make such blanket comparisons is that some variables will be correlated to one another purely by chance.
Although it’s quite unlikely that any two independent variables will be correlated, if you start making hundreds of random comparisons as I was intending, the chances are that I’d find correlations purely by chance. I think Richard Dawkins mentioned in one of his books that one study found that Israeli fighter pilots were significantly more likely to have daughters than sons. Since no one has provided a good scientific reason for this phenomenon, it is thought that the correlation is simply random and the question is why was anyone comparing these two variables in the first place? So with that in mind, any correlations I find ought to be taken with a large grain of salt.
Since this blog entry is becoming a little large, I’ll split up what I planned to write. The next two entries will therefore be about what I learnt from this project, firstly in terms of what measurements are correlated to what other measurements. Secondly, but more importantly, I’ll write about what I learnt in attempting to make a program to deal with the data from the Guardian Data Store.