Character frequency analysis

As a start to a dictionary that might one day learn Chinese, I created a little program that analyses the frequency of characters in a collection of texts.Currently, I have over 50,000 characters worth of text, which includes about 2000 different characters. My analyser program started off simply counting how often each of the characters occurs in these sample texts. This should help me identify which characters are most common and so best to learn. The program found that the most common character in the texts I've collected is 的, which makes up 3.1% of the characters. According to A Key to Chinese Speech and Writing, 的 is the most common character in Chinese and makes up about 4% of text; I’m not sure why it is less common in my text, but I suspect it’s because 的 occurs more often in more complex sentences.

In my text, the most common characters are 的, 了, 我, 一, 是, 不, 这, and 他. 

 A Key to Chinese Speech and Writing states that the most common characters are 的, 一, 了, 是, 不, 我, 在, 有 and 人 in that order. My analysis may be skewed by Alice in Wonderland (from nciku, and a translation I bought in China), which currently makes up the bulk of my text. The result is in a glut of ‘Ailisi’, Red Queens and Cheshire cats. The other sentences are from smart.fm courses and various study books, so are generally quite simple, which is probably why the pronouns and 这 are more common than normal.

I then expanded the analysis to count the frequency of characters before and after each of those in the text. I hope this will help me to learn how characters are most commonly used. For example, I first learnt that 发 means 'hair or 'to emit'; my program shows that ~45% of the time, 发 is part of the word 发现 or 发生 (meaning 'to discover' and 'to occur' respectively). I think that learning these meanings is probably a better strategy than trying to learn what the character means by itself.

Text analyser 1

One difficultly at the moment is it’s not possible to know whether the character in question actually forms a word with previous or following characters. However, if the combination of characters is very common, then it may not really matter. For example, 我 is followed by 的 about 10% of the time, and it probably doesn’t matter whether you consider 我的 as a word meaning 'my' or a common grammatical construction of 'me' + 'possessive'. The next step in my program will be to combine characters into words (probably using a dictionary to check whether they are really words, or perhaps just joining frequently occuring pairs of characters). I can then see which words are the most common and what words border them. However, I foresee a potential problem with overlapping words, or words made up of more than two characters.

Another feature that I’d like to add is the ability to group words into grammatical groups, such as pronouns, verbs, adjectives etc. (I hope to add this information to my Chinese reader too). I should then be able to identify grammatical patterns, such as which words tend to precede adjectives (很 and 真 for example). It would also identify patterns such as 们 always preceding a pronoun, though I’m not sure how to deal with this if 我们, 他们, 你们 etc. are viewed as single words. Maybe I’ll just look at sentences as both collections of single hanzi and as collections of words.

Comments

First off, amazing website. I think you and I are on the same wavelength -- but you are doing a very nice job of externalizing.

I admire your goal of generating a lexigraphical resource (dictionary) from a corpus using unsupervised machine learning; this is a strong interest of mine too, although I've only taken minor steps in this direction. I recommend this book on statistical NLP; although it focuses on English, the concepts are totally applicable to Chinese. (Plus there is much fun in adapting to other languages!)

---

I saw your Chinese distance mapping simulation -- the "cloud" one. 

I tried this myself for English but was dissapointed with the results: it seems that 2D euclidian space is just not complex enough to permit a nice embedding the word similarity function. Instead, you can try using a dendrogram. I'd love to see this for Chinese. Here is a link to what I got from my program: http://quasiphysics.wordpress.com/2011/08/17/clustering-jane-austen/

---

You wrote: 

One difficultly at the moment is it’s not possible to know whether the character in question actually forms a word with previous or following characters.  

I believe that you should be able to determine whether bigrams form words just by looking at their relative distribution. Afterall, what are words but collocations anyway?

You can do something like this:

S = P[AB] / (P[A] * P[B])^alpha

Start with alpha = 1 and see if large S correlates with "AB" being a word. IIRC different values of alpha other than unity may be helpful. Or, you might use some other demoninator f(P[A],P[B]).

---

Ok, that's all I have time for for now, but I look forward to reading more of your blog/site.

Talk to you soon,

 Chris

 

Thanks for the comment.

Having had a brief look at your website, I also think that we're on the same wavelength. I also suspect you're much better than me at both Chinese and maths. It's slightly embarrassing to read over these old posts; at the time of writing them, I'd not even heard of NLP. It was quite gratifying when I started an online NLP course (a Coursera) to find that I had been thinking along the correct lines (but annoying that it had already been done and with a lot more rigour).

I take your point about words being collocations, although your approach seems to make the assumption that you would otherwise predict characters to be randomly distributed. I don't see how it would distinguish between words (say 他们) and grammatical patterns (say 他是). On the other hand, maybe it doesn't matter. Since one thing I was interested in seeing whether I could predict grammatical patterns and it sometimes hard to saw what is and is not a word (you could argue that 他们 is not a word, but a grammatical pattern - 他的 might be a better example).

Your comment has inspired me to look back at this project and maybe try some more things. I've learnt quite a bit more bioinformatics in the meantime and have always thought bioinformatic algorithms are ideal for language analysis since they are all about comparing strings of sequences.

I look forward to reading your site and hearing from you,

Peter

I think these ideas are amazing and, I'm thinking that you may be on to something with how you can analyze the language and find patterns and understanding of how it comes together.  It may be the key to unlocking a better interpretation program.  Right now I find that many interpretation apps/programs/sites are kind of only fair at translating Chinese.  It would be interesting to see if something better can be done by using some kind of analyizing AI, which could also help to teach us the intricasies of the language and how to construct our sentences in a more natural way.

Have you done anything in the way of a public release of this software?  If so, where can I download it?

Post new comment

The content of this field is kept private and will not be shown publicly.