Tuesday, 13th October 2009
Some basic clustering
In anticipation of having some data to analyse, and due to reading An Introduction to Bioinformatic Algorithms and Information Theory, Inference, and Learning Algorithms, I decided to create a program that can cluster data. The program uses either the hard or soft k-means method, because they are mentioned in both books, though the Information Theory book has more complex algorithms that I might try, if I can understand them.
To help me understand how the clustering algorithm works I’ve written a program (in Python and Tkinter, yet again) that creates a number of random clusters of two-dimensional (to make it easy to display) data, and shows how they algorithm clusters them. Each of the underlying clusters is given a random (x, y) coordinate, a random distance, and a user-defined number of associated data points. Each data point is then given a random (x, y) coordinate that is a random distance (uniformly distributed between 0 and the cluster’s distance value) from its clusters (x, y) coordinate. In a later version I would like to test clusters that are not uniformly distributed and are not circular. The job of the clustering algorithm is to determine which data point is associated with which cluster
The hard k-means clustering algorithm works by predicting the (x, y) coordinates of the true clusters and assigning each data point to the nearest predicted cluster. The predicted clusters are first given random (x, y) coordinates, then each data point is assigned to the closest one, and the (x, y) coordinates of each predicted cluster is updated to be the mean of each of its data points’ (x, y) coordinates. If a cluster has no associated data points (that is, each data point is closer to another predicted cluster), then I give it another random (x, y) coordinate. The process iterates until no changes in the clusters occurs. The soft k-means clustering algorithm works in a similar way, except that instead of assigning each data point to the nearest cluster, each cluster is given a ‘responsibility’ for each data point. The responsibility is a value between 0 and 1, and is a function of the predicted cluster’s distance from the data point.
The algorithms seem to work pretty well, however, as you can see above, they are far from perfect. The program draws a line between each data point and its associated cluster. The line is coloured depending on which of the underlying clusters the data actually comes from in. In the example below, the black cluster is predicted perfectly, whereas the blue cluster is split in half. I think the algorithms fail particularly when the data is from long thin clusters. They can also sometimes end up stuck with predicting two clusters where there is one if the data is very spread out. Repeating the process with randomly initialised cluster generally allows this phenomenon to be overcome (although repeatedly trying to cluster the data shown below, always seems to end up with a cluster joining the yellow data to some of the blue data).