Saturday, 24th March 2012
I recently started work on a program to write pinyin nicely to Chinese text. In order to test whether it corrected add tone marks to all possible pinyin, I worked through a table of pinyin and was surprised by some of the valid pinyin, which I've not come across before.
Pinyin ending in -en
Initials: d, t, n, l
There are many words ending in -en, but there are gaps in my table corresponding to MDBG. According to cojak.org, 扽 (to move, shake) can be pronounced , but is the normal reading. Similarly, 参 (to participate) can be pronounced (also , , or ), but is normally , which is how I learnt it. There was also an entry for nen, for which there is a single word on MDBG: 嫩 (nèn; tender or inexperienced).and . There was an entry for , although there was no such word on
Initials: z, c, s
Other than the very common word, 怎 (; how), there is only one other zen word on MDBG: 谮 ( ; to slander). As I mentioned above, 参 can be pronounced according to cojak.org, but MBDG lists just three words: 岑 ( ; small hill), 涔 ( ; overflow, rainwater) and 嵾 ( ; uneven). Finally, I had known one sen word: 森 ( ; forest), and it seems there is only one other word: 椮 ( ; lush growth).
Pinyin ending in -ei
There is a single character pronounced: 诶, which means "hey" and can have any tone. According to MDBG, the meaning are: - to call someone; - to express surprise; - to express disagreement; and - to express agreement. It is debatable to what extent these are really words.
Initials: g, k, h
When I was first learning Chinese, I noticed, that while 给 (; to give) is very common, it is the only Chinese word pronounced . There is also only one word pronounced : 克 ( ; to scold, beat; more commonly pronounced and meaning gram or to restrain). There are two words pronounced : 黑 ( ; black) and 嘿 ( ; hey), which I suspect is more modern.
Initials: zh, ch, sh
Like 给, both 这 (or ; this, here) and 谁 ( ; who) are very common words with unique pronunciations. There is no word pronounced .
Initials: z, c, s
There are no words pronouncedor , but there is one pronounced : 贼 ( ; thief, deceitful).
Initials: d, t, n, l
Again, like 给, 得 (děi; must, ought to), is common and unique. Like 这, 哪 (nǎ; which) has an alternative ei-pronunciation: něi. There are two otherwords, both quite common: 内 (nèi; inside) and 馁 (něi; hungry). There are many words pronounced but none pronounced .
This is by no means a definitive look at pinyin frequencies, I know I have missed several rare sounds [EDIT: such as miù (谬, meaning to deceive)]. At some point I'd like to get a full set of counts for all the different sounds for all the words in the MDBG dictionary. I suspect that words with the initials b, p, m or f are most frequently, whilst words starting with d, t, n or l are probably the least frequent.
I don't have any explanation for the distribution of sounds. In several cases the rare pinyin are associated with a common word. I wondered if this was to reduced the chance of confusion by making the most common words, the most different. However, if this were the case, then you would expect the most common verb, 是 to have a rare pronunciation, rather than shì, which I think is the most common.
I wonder if there is some connection with the fact that in most languages, irregular verbs are most likely to be common verbs (to be, to go, to have etc.); verbs used less often have simple rules for past-tense etc. because people are less likely to remember irregularities if they rarely come across them. Maybe there used to be a different distribution of sounds in Chinese, but they have shifted over time, leaving only common words with the more unusal sounds. But that's only a vague hypothesis and I have no evidence for it.