Chinese Analysis
This project came about as I was working on my Chinese Reader. I was thinking about how to store words in a dictionary. I realised that by analysing the relationships between words Chinese, I should be able to achieve several goals:
- To predict the correct pronunciation or meaning of a word that has more than one, based on context. For example, when is 地 pronounced and when it is pronounced ? When is 花 a noun meaning 'flower', and when is it a verb meaning 'to spend'?
- To identify grammatical patterns, such as Verb-不-Verb, 越来越-Adjective, 连...都.
- To identify patterns in words, for example, 人 often follows a country and means a person from that country.
- To predict when an unknown string of character is a name.
- To determine which verbs are associated with a given noun and vice versa. For example, if I know the word for dream is 梦, but don't know how to say I had a dream then looking up 'had' or 'have' in the dictionary is unlikely to be helpful. However, searching a corpus of text for verbs associated with the noun dream should tell me that in Chinese you can make, 做, or see, 看 dreams.
In essence, what I want to create is an artificial intelligence that can learn to comprehend Chinese. This should be an interesting problem, especially given that I myself are far from fully understanding the intricacies of Chinese. At the very least, I hope that in attempting to build such AI, I will learn more about how Chinese sentences are organised.
Because I was reading An Introduction to Bioinformatic Algorithms whilst I was thinking about these problems, I ended up applying various bioinformatic algorithms to Chinese with some success. I think bioinformatic algorithms are ideal since they are generally used to identify patterns in or similar between sequences of symbols.