Basic analysis

22 Jan 2019

Introduction

As a basic sanity check for my word counts, I did some basic frequency analysis. This sort of thing has been done a lot, so I can check my results with what's expected. It's also quite useful for word puzzles and cracking simple ciphers.

Word frequencies

Here's the top ten most common words according to my list. If you compare it to the word list of on Wikipedia, you'll notice that my list appears to be missing the two verbs, "be" and "have". That's because the Wiki word list uses lemmas, meaning they combine all the different forms of a word. So the words "be", "is", "are", "am", "was", "were", etc. are all combined into a single root word: "be". Likewise, "have" will be combined with "had", "got" etc..

Other than that, my list is in pretty good agreement about the top ten.

Word lengths

I also had look at the distribution of word lengths. The peak (mode) at is at three letters, the median is four letters, and the mean is 4.5 letters. Surprisingly, given that there are only two one-letter words, they make up nearly 4% of words in the corpus. As the word frequency graph above shows, a makes up over 2% of words and I just over 1%.

The longest word is the 21-letter "counterdemonstrations", which appears 5 times in the 400+ million word corpus.

Letter frequencies

Finally, I took at look at letter frequencies, which is also something that has been well studied. My counts are in pretty good agreement with the accepts order of letters, with just the r and h and c and u being swapped.

There is a noticeable jump to the runts of the alphabet: x, j, q and z.

Start and ending letters

Frequency with which letters start a word

Frequency with which letters end a word

Given that e is the most common letter, it's a bit surprising that it's the 16th most common letter to start a word. On the other hand, nearly one in five words in e. Conversely, a and i are common at the start of words, but much less likely to end a word.

To analyse this a bit further, I looked at the probability that a letter will start a word relative to the probability that it end a word.

The graph shows a 100% starting rate for q because no words in my list end in q. j and v both have values of 99.9%, with just taj, raj, hajj, haj, hadj, swaraj, 'lev', 'rev', 'shiv', 'dev', 'moshav', 'tav', 'vav' being the rare exception. I'm surprised b and c are so high (particularly c, given words like comic, magic and music).

At the other end of the spectrum is x with a value of 0.6%. x starts very few words, but can end many common words (six, tax, sex, box, complex, mix, fix, index, wax, and fox are the top ten). As I've already noted e is there, as are n, y, d and r, which are all parts of common word endings.

Analysing English