What is a word?

19 Jan 2019 Code on Github

Introduction

For most of these analyses, I've been using a list of word counts from contemporary American English sources, which I got from http://www.wordfrequency.info (it seems the list is no longer available there. Davies, Mark. (2011) Word frequency data from the Corpus of Contemporary American English (COCA).

When is a word not a word?

The COCA list was scraped from various sources including scientific books and internet forums. It includes 365,749 "words" which appear a total of 404 million times. The words are defined as any sequence of non-number characters found in the text. Which of these sets of characters are words is a matter for debate and depends on context.

For my projects I was interested in the common patterns of letters in words, so I wanted to filter out most of the word types in the list below, with the possible exception of some jargon words and proper nouns.

abbreviations e.g. Dr and co.
acronyms, e.g. NATO
contractions e.g. 's and 'll
hyphenated words, e.g. long-term
initials, e.g. USA
jargon, e.g. theobromine
proper nouns, e.g. Obama
slang, e.g. lol and qwq
symbols, e.g. XII
typos e.g. teh and abd

Notice that contractions appear as separate words, so you'll is not in the list, but 'll is. This surprised me, but make sense since 'll is the actual word.

Filtering the word list

To filter out typos and the more obscure proper nouns and jargon, I used the "CROSSWD" list (found here) from Project Gutenberg's Moby Word II project. This is a list of 113,809 words permitted in crossword games such as Scrabble. I was originally going to use the COMMON word list, which is words found in common with two or more published dictionaries, but it doesn't include word endings, so excluded words like "says" or "going".

The only change I made to the CROSSWD list was to add the words "a" and "I" to the list as these are not allowed in Scrabble or crosswords. I did wonder whether the list would be limited to words that fit on a Scrabble board, but it includes counterdemonstrations, hyperaggressivenesses and microminiaturizations, which are all 21 letters long (and shouldn't be single words in my view, but that's by-the-by).

Removing any word that doesn't appear in the CROSSWD list significantly reduces the number of words to 69,306 (less than 20% of the original number). At the same time the total occurance of words is only reduced to 370 million (over than 92% of the original). So while we're reducing the number of words, we're mainly getting rid of infrequent words.

Words in COCA that not in CROSSWD

's
're
'm
've
'd
'll
mr
american
u.s.
ca

Words in CROSSWD that are not COCA

camisades
counterprotests
gabbroid
overtreats
powters
osculated
shopboy
outseen
overcropped
norias

The first list is sorted by word frequency and shows that we are successfully filtering out contractions and initials. It does seem weird not to allow the word "American", but then it is a proper noun. The second list is not sorted by frequency since I don't have the frequency. The reason being that these words appear so infrequently that the didn't appear once in the corpus of 400 million words.

The final list still contains some proper nouns, jargon and initials, like guldens, euglena and khi, but I think it's good enough. Those words occur at a very small frequency, so shouldn't have too much effect.

Analysing English