Chinese Analysis

The idea for this project was born while I was working on my Chinese Reader. I was thinking about how to store words in a dictionary so that I could quickly identify compound characters in a text. I realised that by analysing the relationships between words in a wide range of texts, I might be able to achieve several goals:

In essence, what I want to create is an artificial intelligence that can learn to comprehend Chinese. This should be an interesting problem, especially given that I'm far from fully understanding the intricacies of Chinese myself. At the very least, I hope that in attempting to build such AI, I will learn more about how Chinese sentences are organised.

Because I was reading An Introduction to Bioinformatic Algorithms whilst I was thinking about these problems, I ended up applying various bioinformatic algorithms to Chinese with some success. I think bioinformatic algorithms are ideal since they are generally used to identify patterns in or similar between sequences of symbols.

Character frequency analysis

As a start to a dictionary that might one day learn Chinese, I created a little program that analyses the frequency of characters in a collection of texts.Currently, I have over 50,000 characters worth of text, which includes about 2000 different characters. My analyser program started off simply counting how often each of the characters occurs in these sample texts. This should help me identify which characters are most common and so best to learn. The program found that the most common character in the texts I've collected is 的, which makes up 3.1% of the characters. According to A Key to Chinese Speech and Writing, 的 is the most common character in Chinese and makes up about 4% of text; I’m not sure why it is less common in my text, but I suspect it’s because 的 occurs more often in more complex sentences.

In my text, the most common characters are 的, 了, 我, 一, 是, 不, 这, and 他. 

 A Key to Chinese Speech and Writing states that the most common characters are 的, 一, 了, 是, 不, 我, 在, 有 and 人 in that order. My analysis may be skewed by Alice in Wonderland (from nciku, and a translation I bought in China), which currently makes up the bulk of my text. The result is in a glut of ‘Ailisi’, Red Queens and Cheshire cats. The other sentences are from smart.fm courses and various study books, so are generally quite simple, which is probably why the pronouns and 这 are more common than normal.

I then expanded the analysis to count the frequency of characters before and after each of those in the text. I hope this will help me to learn how characters are most commonly used. For example, I first learnt that 发 means 'hair or 'to emit'; my program shows that ~45% of the time, 发 is part of the word 发现 or 发生 (meaning 'to discover' and 'to occur' respectively). I think that learning these meanings is probably a better strategy than trying to learn what the character means by itself.

Text analyser 1

One difficultly at the moment is it’s not possible to know whether the character in question actually forms a word with previous or following characters. However, if the combination of characters is very common, then it may not really matter. For example, 我 is followed by 的 about 10% of the time, and it probably doesn’t matter whether you consider 我的 as a word meaning 'my' or a common grammatical construction of 'me' + 'possessive'. The next step in my program will be to combine characters into words (probably using a dictionary to check whether they are really words, or perhaps just joining frequently occuring pairs of characters). I can then see which words are the most common and what words border them. However, I foresee a potential problem with overlapping words, or words made up of more than two characters.

Another feature that I’d like to add is the ability to group words into grammatical groups, such as pronouns, verbs, adjectives etc. (I hope to add this information to my Chinese reader too). I should then be able to identify grammatical patterns, such as which words tend to precede adjectives (很 and 真 for example). It would also identify patterns such as 们 always preceding a pronoun, though I’m not sure how to deal with this if 我们, 他们, 你们 etc. are viewed as single words. Maybe I’ll just look at sentences as both collections of single hanzi and as collections of words.

Word frequency analysis

In the previous update, I wrote about an analysis of single characters in Chinese text. I have now updated my program to analyse words instead of individual characters by searching the open source Chinese-English dictionary, CC-CEDICT, and checking whether combinations of characters form words. This slows program significantly and leads to a problem with names, most of which aren’t in the dictionary. However, speed yet isn't too much of an issue and I’ve created a way to output the information, so that once the analyser has been run once, I can save the information. The I can quickly load it in other programs that make use of it. I think I’ll get around the problem of names by having a dictionary of my own that the program checks first. This should also speed up analysis as this dictionary will contain the common words and will much smaller, thus quicker to search.

I have run the program again, on a slightly larger dataset (just over 50,00 words, but I haven’t counted how many individual characters this represents) as the number of texts I have slowly increases. The analysis doesn’t differ too much from before, only it removes some of the redundancy of, for example, having 什 and 么 separate. It also reduces the frequency of certain words, such as 我 and 的, as 我的, 我们 and 他的 are counted as separate words. I think the main benefit will come when I add word types (noun, verb, abjective etc.).

Below is a screenshot of the app with 发现 selected. In the lists you can see various other compound words, such as 不知道. You can also see that I’ve added the pinyin and meaning of the selected word, although I need to tidy up how this is displayed. I’d also like to add a search bar, so you can find specific words (it took me a while to find 发现). I’ve tried to add a better font (cyberbit.ttf), but for some reason, it has only replaced some characters (characters for which there is an alternative traditional character, I think). In fact, the same happens regardless of the font I used in Tkinter. Something to sort out later. I have at least sorted the problem in Pygame, but more on that later.

Screen shot of my word analyser

You can see that the app also now contains a Generate button and a box of gibberish. I'll explain the meaning of this in the next update.

Making Markov chains

The five most frequent words following 这些 and so on

The most recent function I've added to my app is the ability to generate random sentences using the statistical patterns seen in the text. The program makes a note of the frequency of words that start clauses, picks one based on these probabilities, then next a word based on what word follows that one, and so on. Basically, the program constructs a Markov chain and uses this to generate a sentence; it’s what some spammers do to try and get around email filters. This is what can be seen at the bottom of the app shown the previous update.

Here are some example sentences, which range from the passable (e.g. sentence 1: “French people are also good”) to the gibberish:

  1. 法国人也很好
  2. 我要爱护地球环境很友好
  3. 这个问题
  4. 小王月文
  5. 她的机会来了电话号码
  6. 音乐家没有时间延长了
  7. 他们有五十六岁
  8. 长城
  9. 她想三月去逛街
  10. 这里出车祸了

After that, I decided I wanted better, more visual way to view the network. I tried sketching out a few networks, but the problem is things rapidly get messy and it’s difficult to get the layout right. This is a problem I’ve had before when trying to view a network I generated for a Go AI (which I’ll have to write about some time). To solve this problem and the Go AI network, I got the problem to solve itself spontaneously, by creating a physics simulation. The basic program is adapted from my molecular dynamics simulation described here. Basically it works by having every particle repel every other particle and having every bond, or interaction between particles contract to an certain length. Once the parameters have been played with, things are left to bounce about and after a short period of time, everything is nicely spread out. Sometimes, there can be problems if there are lots of loops in the network, but giving things a shake normally sorts things out. Shaking the network slightly makes it move in an pleasingly physical, jelly-like way.

The graph is centred on the word 这些, which happens to be the 32nd most common starting word in my analysis. Then I looked at the five most frequent words following it, and the five most frequent words following each of them (for 水平 and 花 there were fewer than five following words). The graph is built up with a recursive function which I’m quite pleased with, because I so really use recursion in my programs. It means I can increase the depth of the graph easily, but the image rapidly fills up and the program slows.

I still need to add the ability to create cross links. For example, there are are actually five bubbles containing the word 很 in the image, and these should be replaced by a single one with five inputs. It could be argued that the image is clearer with separate bubbles, which may turn out to be the case. However, it took me quite a long time to realise that there are five 很 (and two 的) bubbles in the current image, and it’s quite an interesting point. I think the reason for all the 很s is that 这些 is likely to be followed by a noun (the image shows us that the five most common words following 这些 are all nouns), and nouns are often followed by 很. This also highlights why it will be useful to add the word categories (such as noun). The image also highlights a problem with this approach: 了 frequently follows 花, which is because 花 can be a verb as well as a noun (e.g. in the sentence 我今天花了很多钱, which is in my analysis). Any attempt to categories words will have to deal with words that have multiple functions.

I've found a similar sort of analysis of The Illiad here. Maybe I'll create a Java applet so people can interact with the networks.

Defining the distance between characters

As part of my analysis of Chinese texts, I have been attempting to come up with a way of measuring the similarity between Chinese words. The ultimate aim of this would be to identify synonyms, such as 高兴 and 快乐, although initially I’d like to see whether I could predict whether an unknown character in a sentence was a verb, noun, adjective etc.. I could then use this type of analysis in my contextual dictionary, to predict, say, in the sentence 我今天花了很多钱, whether 花 was being used as a verb meaning ‘to spend’ (which it is), or as a noun meaning ‘flower’.

Predicting what role a word has in a sentence will require identifying its position in the sentence and seeing what other words appear in this type of position. My text analyser currently has some basic positional information for all characters in the texts I’ve collected, specifically the frequency of characters preceding and proceeding it. I wanted to see whether I could use this information to cluster characters. In order to cluster characters, I first need to define the distance between any two characters. I decided to measure this as the difference in percentage frequency for all preceding and proceeding characters.

For example, if we  just look at proceeding characters, and have:

  • Character A, which is followed by 我 60% of the time and 你 40% of the time
  • Character B, which is followed by 我 40% of the time, 你 40% and 他 20% of the time

The difference between characters A and B will be |0.6 – 0.4| + |0.4 – 0.4| + |0 – 0.2|, or 0.4. There are a number of problems with this method, but it’s a quick and dirty method of getting a measure of similarity between characters. One problem with this measure, it doesn’t take into account the fact that there will be a greater uncertainty in frequency of characters that pre- and proceed a rare character compared to a more common one.

By this measure, 呸 and 噢 are the closest two characters in the corpus of text I’ve collected so far.

The distance between 呸 and 噢 is zero because both are preceded and proceeded by no characters ever. This may therefore seem a useless comparison, but actually the analysis has identified two similar words – both interjections, so occur by themselves, as it “呸!”. However, I’m not certain that I’d call them the most similar Chinese words.

If we restrict the search to more common words (those that appear at least five times in the text), then we find that 寸 (meaning ‘inch’) and 尺 (meaning ‘foot’) are the most similar. The reason for their similarity is that they occur in Alice in Wonderland, almost always in the phrase 英寸高 or 英尺高. So at least the distance function seems reasonable. The most distant characters according to this analysis are 即 and 般, though I’m not quite sure what this means. I suppose it means that these two words are the least able to be exchanged.

Visualising character similarities

Cloud of clustered hanzi

In the previous update, I defined the distance between two characters. Using this metric, I could create a matrix of distances between a selection of characters. The problem was then how to display this information. My first thought was to create a diagram similar to a phylogenetic tree, which would have the benefit of forcing me to learn how these trees are generated. I have finally created a program that can create, and draw a tree based on a simple hierarchical clustering algorithm. However, while I was struggling with the inevitable recursion these trees require, I hit upon a much simpler method of displaying the relatedness of character: make them organise themselves on the screen.

To get the characters to spontaneously arrange themselves, I used a particle simulation that I’ve used to display networks. Between each pair of characters,  I created an interaction (essentially a spring) with a length proportional to the square of the distance between the characters. Then I tweaked the parameters (such as the interaction strength and the scale of interaction lengths), and let of the simulation run. The characters jiggled and jostled and eventually arranged themselves in what appears to be quite a sensible fashion (see video below).

The characters included those that make up > 0.5% of single characters in the texts I have. Between them, these 31 characters make up just over 31% of characters in the text; the size of characters gives an approximate indication of their relative frequencies. Since every character interacts with every other, displaying the interactions makes it hard to see what's happening. I also prefer not to display the the circles that represent the particles, as in the screen shot. In the video I toggle between the interaction and particle displays. I also experimented with creating a icon for the program, but that’s by-the-by and didn’t work very well

The characters form consistent, and logical clusters. The most obvious cluster is the pronouns (我, 他, 你 and 她), in the centre of the screen. These words are clearly very related and we would expect them to occur in the same parts of a sentence. Other than that, it’s reassuring to see the adjectives 大 and 小 , and the prepositions, 上 and 下 near one another. You can also see that the verbs are nearly all on the right of the image (one exception is 爱, which, in the image, is separate because the vast majority of the occurrences of 爱 are in the name 爱丽丝). The characters 个, 么 and 地 are on the outskirts and appear to represent characters that appear in unique parts of a sentence. I wonder if it might be easier to see patterns if I coloured characters based on their grammatical function (though deciding on how to categorise characters isn’t always easy).

Below is a video of the cloud once I had solved the problem with 爱丽丝. It shows the jelly-like network of characters arranging themselves. The clusters of pronouns very robust as can be seen when I move the 我 particle. The position of 地, on the other hand, is more flexible; it is on the outside of the cluster.

Building hanzi trees

Tree showing the similarity of hanzi

In the previous update, I showed how a physical simulation could be used to cluster hanzi. The clustering seen by that approach can be recapitulated using a more rigid and rigorous approach. I have finally got basic tree-generating and -drawing program to work. It uses a very simple algorithm: first it finds the smallest distance in the distance matrix, then it joins those hanzi into a cluster whose distance from the other hanzi is the average of the distances to each of its components. When new branches are added to a cluster, clusters are arranged to minimise the distance between neighbouring hanzi on the two branches (it’s hard to explain and even harder to code). The algorithm is based on an explanation in An Introduction to Bioinformatic Algorithms, which has goes on to explain more complex methods. It also explains a lot of algorithms for search sequences for patterns, which I might try to use on Chinese.

The tree shown is based on the same selection of characters as used before. Like the physical clustering, the tree shows how the pronouns, adjectives, prepositions and verbs form distinct clusters (I’ve excluded 着 from the verbs because it more often used as a verb ending, but it forms an out-shoot of the verb group, perhaps because of its dual role). The characters 个, 么 and 地 again form out-groups. Interestingly, 子 and 们 also cluster, and both are used to suffix nouns. Also, 了 and 的 are found within the verb group, which makes sense as they often suffix verbs so will be followed by the same sort of words that follow verbs (e.g. you can say both 我喝一杯茶 and 我喝了一杯茶). It’s also interesting to see how the verb group breaks down: the verbs of movement, 去 with 来, the verbs of being 是 with 在, and the verbs of sensing, 看 with 说, though maybe this is coincidental. We can find explanations for 一 and 这 forming a group (they are often followed by measure words and form the subject of a sentence), and for 就, 不 and 很 (adverbs that come before verbs).

In conclusion, the simple distance function I created seems to give a reasonable measure of the relationship between hanzi.

Using this distance function might therefore be used to predict the role of a word in a sentence. Think the approach could be improved when I have a measure of how often a character follows verbs in general. I wonder now whether I could avoid having to class the words as verbs, nouns etc., and get the program to build it’s own categories and then feed this information back into the system to improve it. For example,building this tree, the program could, with some degree of certainty, decide that 我, 他, 你 and 她 form a group and then calculate how often other characters are preceded and proceeded by this group.

Expanding the tree

If I extend the tree to contain all characters with a frequency of more than 0.04%, then the new characters fit nicely into the same groups. The tree can be split into two major groups, which I call the noun-type and verb-type groups. The noun-type group (in blue below) contains the nouns and words connected with nouns, such as the suffixes 们 and 子; the determiners, 那 and 这; the prepositions, 上 and 下; and the adjectives. The verb-type group (in green below) contains the verbs and adverbs. In addition, there are some outliers which are found in very specific contexts. For example, 么 is only found after 什, 这, 那 or 怎; 丽 and 丝 are almost exclusively only found in the name Alice (in my texts at least).

Tree of hanzi with >4% frequency

N.B. I made a small change to the way the tree is displayed such that the branch lengths are proportion to the distances between characters, although the tree is still ultrametric (so all the leaves line up).

Predicting names

In all the analysis that I've written about, one type of word that consistently skews results and could lead to flawed conclusions is names. For example, in the previous tree 爱 has an unusual position due to the fact that it is most commonly found (in my texts), in the name 爱丽丝 (Alice). Often when reading Chinese text, I will struggle with an unknown word only to find that it is a name (though I'm getting better at spotting names now). I had thought to create a list of names for the analyser to check, but that means I would have to read all the texts first to check myself. I would therefore like to have my Chinese reader predict which words, in a given text, are likely to be names.

There are several clues that should help identify names:

  1. Characters appear more frequently than you would expect. For example, though 爱 is a relatively common verb (especially in beginner's texts for some reason), in Alice in Wonderland, it is the 10th most common character, making up 1.4% of characters, compared to 0.14% in all other texts.
  2. Characters consistently appear with other unusual characters. For example, in my analysis, 94% of occurrences of 丽 are preceded by 爱 and followed by 丝.
  3. Words appear before words, such as 先生 or 太太. For example, 丁丁 is followed by 先生 43% of the time. Also, words after 小 or 老 could be surnames.
  4. In fact, rather than create a list of specific words associated with names, my analysis should be able to identify such words. For example, 丝 is followed by either 说 or 想 24% of the time, so is unlikely to mean silk in these circumstances.
  5. Finally, names often consist of phonetic characters, such as 克, 巴, 特 or 尔.

Neighbour joining

Neighbour-joining tree

I've updated my tree-drawing program to create neighbour-joining trees and show branch lengths proportional to the distance between words, although, as you can see there is not a huge difference between branch lengths. I would like to change the way to tree is drawn so the branches spread out from the centre and distance are given by the total branch length, not just the distance along the x-axis as they are now.

Before I updated the tree I had to solve a problem with the way my dictionary searched for words in text. Previously it was unable to identify that 爱丽丝 was a word (despite being in my dictionary), because it looked character-by-character, and since 爱丽 isn't in the dictionary, it concluded that 爱 was a single character and stopped checking. The new dictionary was inspired by reading about suffix trees in An Introduction to Bioinformatic Algorithms and while it takes a bit longer to initialise, it is much faster to search and has radically reduced the time taken to analyse texts. It's interesting how many bioinformatic algorithms can be applied to linguistic analysis, but perhaps not surprising given that both are concerned with finding patterns in strings of letters. The algorithms are particularly useful with Chinese text since there are no spaces between words.

One questions that occurred to me and which I should now be able to answer is: do words that have the same or a similar pronunciation tend to occupy different part of a sentence to avoid confusion?

The tree is based on 5141 lines of text, consisting of 92 800 words and 132 000 characters; it is built from the 34 words that have a frequency greater than 0.35% in the text.

 

Rare pinyin

I recently started work on a program to write pinyin nicely to Chinese text. In order to test whether it corrected add tone marks to all possible pinyin, I worked through a table of pinyin and was surprised by some of the valid pinyin, which I've not come across before.

Pinyin ending in -en

Initials: d, t, n, l

There are many words ending in -en, but there are gaps in my table corresponding to ten and len. There was an entry for den, although there was no such word on MDBG. According to cojak.org, 扽 (to move, shake) can be pronounced dèn, but dùn is the normal reading. Similarly, 参 (to participate) can be pronounced dēn (also sān, shēn, cēn or sǎn), but is normally cān, which is how I learnt it. There was also an entry for nen, for which there is a single word on MDBG: 嫩 (nèn; tender or inexperienced).

Initials: z, c, s

Other than the very common word, 怎 (zěn; how), there is only one other zen word on MDBG: 谮 (zèn; to slander). As I mentioned above, 参 can be pronounced cēn according to cojak.org, but MBDG lists just three words: 岑 (cén; small hill), 涔 (cén; overflow, rainwater) and 嵾 (cēn; uneven). Finally, I had known one sen word: 森 (sēn; forest), and it seems there is only one other word: 椮 (sēn; lush growth).

Pinyin ending in -ei

There is a single character pronounced ei: 诶, which means "hey" and can have any tone. According to MDBG, the meaning are: ēi - to call someone; éi - to express surprise; ěi - to express disagreement; and èi - to express agreement. It is debatable to what extent these are really words.

Initials: g, k, h

When I was first learning Chinese, I noticed, that while 给 (gěi; to give) is very common, it is the only Chinese word pronounced gei. There is also only one word pronounced kei: 克 (kēi; to scold, beat; more commonly pronounced and meaning gram or to restrain). There are two words pronounced hei: 黑 (hēi; black) and 嘿 (hēi; hey), which I suspect is more modern.

Initials: zh, ch, sh

Like 给, both 这 (zheì or zhè; this, here) and 谁 (shéi; who) are very common words with unique pronunciations. There is no word pronounced chei.

Initials: z, c, s

There are no words pronounced cei or sei, but there is one pronounced zei: 贼 (zéi; thief, deceitful).

Initials: d, t, n, l

Again, like 给, 得 (děi; must, ought to), is common and unique. Like 这, 哪 (nǎ; which) has an alternative ei-pronunciation: něi. There are two other nei words, both quite common: 内 (nèi; inside) and 馁 (něi; hungry). There are many words pronounced lei but none pronounced tei.

Conclusion

This is by no means a definitive look at pinyin frequencies, I know I have missed several rare sounds [EDIT: such as miù (谬, meaning to deceive)]. At some point I'd like to get a full set of counts for all the different sounds for all the words in the MDBG dictionary. I suspect that words with the initials b, p, m or f are most frequently, whilst words starting with d, t, n or l are probably the least frequent.

I don't have any explanation for the distribution of sounds. In several cases the rare pinyin are associated with a common word. I wondered if this was to reduced the chance of confusion by making the most common words, the most different. However, if this were the case, then you would expect the most common verb, 是 to have a rare pronunciation, rather than shì, which I think is the most common.

I wonder if there is some connection with the fact that in most languages, irregular verbs are most likely to be common verbs (to be, to go, to have etc.); verbs used less often have simple rules for past-tense etc. because people are less likely to remember irregularities if they rarely come across them. Maybe there used to be a different distribution of sounds in Chinese, but they have shifted over time, leaving only common words with the more unusal sounds. But that's only a vague hypothesis and I have no evidence for it.