Tuesday, 1st September 2009
Some difficulties mining
The first problem I had in using the Guardian Data Store was in collecting together data from different stories. Perhaps naively or perhaps because I was confused by the release of the Guardian API, which is actually unrelated, I though getting all the data relating to countries would be fairly straight forward. However, as far as I can tell, there is no easy way of getting hold of multiple datasets; I had to open the spreadsheets one-by-one and copy the information into text files. It’s possible that the Google Docs API could be used to collect the information, but I don’t know how. Once I’d got several text files, each of one Guardian spreadsheet with some data relating to several countries, I put them in a folder and made a Python script to collect all the information together in some useful format.
I came across the second problem, when I noticed I’d ended up with a lot more countries than I though existed. It turns out that the various spreadsheet at the Guardian Data Store do not use consistent name for countries, so the US for example, might be referred to as US, USA, America, United States or United States of America. Another problem was the occasional misspelling, which I supposed I should have expected from the Guardian. On the upside I learnt a bit more about thevarious countries. For example, I didn’t know that as well as there being a Democratic Republic of Congo (also called Congo, DRC), there is also a Republic of Congo (aka Congo, Rep.), right next door. Similarly, I didn’t know that South Korea (aka S. Korea) is officially called the Republic of Korea, while North Korea (aka N. Korea) is officially the Democratic People’s Republic of Korea. It’s quite surprising how many countries have the word democratic in their name, especially when they are conspicuously non-democratic.
I considered writing a Python program with a Tkinter interface to help combine the data from countries, but I haven’t yet managed to do that.
Instead, I found it quicker to output the combined data as an XML file, showing all the countries, each containing their respective measurements. Then I could spot where countries had been duplicated and copy the measurements across. I can’t be sure that I’ve spotted all the duplications, but I’ve managed to reduce the number of countries names from over 320 to about 280. I don’t know why the Guardian can’t use a consistent nomenclature, but I suppose the spreadsheets are made by different people and collected from different sources. Still, a single list of countries with the various variable plugged into it would have made my life a lot easier and is surely possible – it would also save journalists from having to write the names of countries out.
[Update: it seems that some of the data now includes three letter codes for countries.]
I have yet to make a good way of dealing with new spreadsheets as they come in, but I suppose it would be good to have a program that highlights when an apparently new country has been discovered, and allows you to pick a country from the current list to combine it with if necessary. Then I can make an updated XML file. I’ll make the XML file that I have at the moment available once I work out how to and have made sure what I’ve written is actually proper XML. At the moment is seems to be broken by the country Curaçao – I don’t think it can deal with odd letters. One benefit I have got from this project is that I have learnt a bit about XML and have been working out how to parse it (I know there are parses out there already, but I’d like to make my own).
I actually have two XML files: in one the countries are the highest level data and contain a list of measurements with their values, in the other, the measurements are the highest level data and contain a list of countries. The latter is more similar to the data is combined and is more useful to my program, but the former allows me to combine countries more easily. I often find that I don’t know which way to store this kind of data, and wonder whether it isn’t most efficient to store it both ways for quick comparisons. Maybe there is some way around this problem I’m unaware of.
The final difficulty (for now) that I had on this project was with R. R is a very powerful statistical program that work through the command line. As such it can be called by other programs quite easily. Part of the reason for my embarking on this project was to brush up on my R, and to see if I could get a Python program to call R. The idea was to use R to calculate correlations, covariances, clustering and whatever else took my fancy, and also to draw scatter plots of the data, so I wouldn’t have to draw the graphs myself.
As a result I could use PythonWin, which is quite a nice IDE (integrated development environment), which actually keeps the graphs I draw using R, unlike running Python from the command line. All in all, it was a pain to have to download these bits and pieces, because it means if anyone else wants to use my program they’ll have to download them too. Also, Python is still not in my registry, despite my efforts to fix it, and every time I import R using Python, it says it can’t find R in the registry and has to find it some other way. Anyway, rant over. I may end up not using R though and instead calculate covariances and draw graphs myself.
My progress with the Guardian Data Analyser app has now halted due to a problem in how to store the data and relationship between data points. I have hit a similar problem with my Chinese Reader App with how I should store words and their relationships. The Guardian app uses XML, which might be OK, but I think storing the data in a database might be more useful. Maybe, with the release of government data in a SPARQL database will encourage the Guardian to do the same. Either way, I think I need to learn how to use SPARQL, or maybe MySQL soon.