Tuesday, 8th December 2009
Choropleth maps of Africa
A few weeks ago I was delighted to see a guide on to how create a choropleth map. The article, not only uses Python, but was perfectly timed with my exploration into SVG files and how they work. The article was also timed with my re-reading of my thesis in preparation of my viva. As I was trying to find the latest statistics concerning rate of human African trypanosomiasis (sleeping sickness), I thought I’d use data to colour a map of Africa.
First I had to get hold of the data. The WHO website has the numbers of new infections reported for two strains of trypanosomes, Trypanosoma brucei rhodesiense and Trypanosoma brucei gambiense. However, the data is in tables in PDFs which makes it a pain to extract. One PDF has the data from 1990 to 2004, another has data from 1997 to 2006 and there are occasional contradictions where they overlap. There is older data too, but I haven’t yet copied that into a usable format.
Once I had some data in a text file, I had to get a blank SVG map of Africa. The map I used is from Wikimedia. I also experimented with a more detailed map that had all the district within each country, but that was too detailed (though the file was organised in an easier-to-use way). I had to make quite a few changes to the map before I could colour it. The first problem was that each country was given a two letter code, which I needed to convert into a full name. Two letter codes avoid the problem of having multiple names for countries (e.g. do you use ‘the Congo’ or ‘the Republic of the Congo’? Neither of which is the Democratic Republic of the Congo’), however, the data from the WHO (and nearly everywhere else) uses the full country names. Besides which, it’s much easier to use the image when the countries have names. Converting the codes to names improved my geography no end. For example, I hadn’t even heard of Benin or Lesotho before, let alone been able to place them on a map. (Lesotho is particularly interesting as it is an enclave, completely surrounded by South Africa).
I had to make further changes due to using Beautiful Soup, as the guide instructed, to analyse the XML. In retrospect, it may have been simpler to write my own simple XML parser or to simply colour the countries using CSS. However, I’m quite glad I tried, as I might be able to use Beautiful Soup for parsing some other XML. The problem with Beautiful Soup is that it can’t handle self-closing tags. It also appears not to handle tags within tags of the same name. This might be sensible for XML, but the image contains groups of groups, which I had to separate. One reason for having groups of groups were the islands which make up a single country, which I will probably remove as they can’t be seen anyway. Another reason for having nested groups is that the whole continent formed a group, which was transformed, so as to be in the centre of the image. I’m not sure why the image was constructed this way and not just draw in the centre in the first place; I will try and alter all the coordinates so that the transformation is not required. My final annoyance with Beautiful Soup is that it insists of rewrite viewBox as viewbox, which Chrome (which may be to blame here) is unable to understand. As a result, every time I create a map, I have to edit one letter in a text editor before I can view it.
I’m pretty pleased with how the maps have turned out (and I like the colour scheme), though there are still some points that need improving. The main problem concerns how the data is split into groups. In the first incarnation of my program, I chose the range of values for each group, but I after switching between data sets repeatedly I wanted a way to determine the range automatically. As you can see, this leads to a slightly odd (though arguably more valid) split of one group every 49 cases for the first group, and one group every 1604.6 in the second. Rounding to an integer would be a start, but I might see if I can also round to a ’sensible’ number, such as 50 for the first map and maybe 2000 for the second. The second graph highlights another problem, which is that all the countries except the Democratic Republic of the Congo (DRC) have the same colour. This is because the DRC had over 7 times as many cases of T. b. gambiense infection as the country with the next highest number of cases. To solve this, I could either use a log scale or the final category could be say ‘2000+’, though the latter might underplay the serious of the problem in the DRC. It might also make sense to have a single category for zero cases, which in this case I think, should be coloured grey, as the WHO only provided data for countries that had had at least one case in the years they were recording.
Another option is to use a continuous scale to colour countries (or rather a scale with so many categories that it appears continuous). This is how a similar map is coloured on the Wikipedia page on trypanosomiasis. The map (which, it turns out, was made by someone I know), shows deaths per 100,000, which is perhaps a better metric (and one I now have the data for). Converting the numbers into a percentage of population is also a better way of illustrating the problem.
Now I have a reasonable way to colour maps, I can use any data about countries. As it happens, I have been collecting data about countries from the Guardian Data Store and other open sources. Below are a couple of examples of maps made using some of those data (Case of AIDS and Hunger). I need to sort out the scales, so they're more sensible.