Tuesday, 16th February 2010
I've updated my tree-drawing program to create neighbour-joining trees and show branch lengths proportional to the distance between words, although, as you can see there is not a huge difference between branch lengths. I would like to change the way to tree is drawn so the branches spread out from the centre and distance are given by the total branch length, not just the distance along the x-axis as they are now.
Before I updated the tree I had to solve a problem with the way my dictionary searched for words in text. Previously it was unable to identify that 爱丽丝 was a word (despite being in my dictionary), because it looked character-by-character, and since 爱丽 isn't in the dictionary, it concluded that 爱 was a single character and stopped checking. The new dictionary was inspired by reading about suffix trees in An Introduction to Bioinformatic Algorithms and while it takes a bit longer to initialise, it is much faster to search and has radically reduced the time taken to analyse texts. It's interesting how many bioinformatic algorithms can be applied to linguistic analysis, but perhaps not surprising given that both are concerned with finding patterns in strings of letters. The algorithms are particularly useful with Chinese text since there are no spaces between words.
One questions that occurred to me and which I should now be able to answer is: do words that have the same or a similar pronunciation tend to occupy different part of a sentence to avoid confusion?
The tree is based on 5141 lines of text, consisting of 92 800 words and 132 000 characters; it is built from the 34 words that have a frequency greater than 0.35% in the text.