Monday, 8th February 2010
Defining the distance between characters
As part of my analysis of Chinese texts, I have been attempting to come up with a way of measuring the similarity between Chinese words. The ultimate aim of this would be to identify synonyms, such as 高兴 and 快乐, although initially I’d like to see whether I could predict whether an unknown character in a sentence was a verb, noun, adjective etc.. I could then use this type of analysis in my contextual dictionary, to predict, say, in the sentence 我今天花了很多钱, whether 花 was being used as a verb meaning ‘to spend’ (which it is), or as a noun meaning ‘flower’.
Predicting what role a word has in a sentence will require identifying its position in the sentence and seeing what other words appear in this type of position. My text analyser currently has some basic positional information for all characters in the texts I’ve collected, specifically the frequency of characters preceding and proceeding it. I wanted to see whether I could use this information to cluster characters. In order to cluster characters, I first need to define the distance between any two characters. I decided to measure this as the difference in percentage frequency for all preceding and proceeding characters.
For example, if we just look at proceeding characters, and have:
- Character A, which is followed by 我 60% of the time and 你 40% of the time
- Character B, which is followed by 我 40% of the time, 你 40% and 他 20% of the time
The difference between characters A and B will be |0.6 – 0.4| + |0.4 – 0.4| + |0 – 0.2|, or 0.4. There are a number of problems with this method, but it’s a quick and dirty method of getting a measure of similarity between characters. One problem with this measure, it doesn’t take into account the fact that there will be a greater uncertainty in frequency of characters that pre- and proceed a rare character compared to a more common one.
By this measure, 呸 and 噢 are the closest two characters in the corpus of text I’ve collected so far.
The distance between 呸 and 噢 is zero because both are preceded and proceeded by no characters ever. This may therefore seem a useless comparison, but actually the analysis has identified two similar words – both interjections, so occur by themselves, as it “呸!”. However, I’m not certain that I’d call them the most similar Chinese words.
If we restrict the search to more common words (those that appear at least five times in the text), then we find that 寸 (meaning ‘inch’) and 尺 (meaning ‘foot’) are the most similar. The reason for their similarity is that they occur in Alice in Wonderland, almost always in the phrase 英寸高 or 英尺高. So at least the distance function seems reasonable. The most distant characters according to this analysis are 即 and 般, though I’m not quite sure what this means. I suppose it means that these two words are the least able to be exchanged.