As I've previously written, I'm an avid consumer of videos and exercises at the Khan Academy. At the current count it contains over 2400 videos explaining various school and college-level topics, from basic addition to linear algebra, finance, biology and chemistry. After watching some 400 videos, I found myself idly wondering how often he uses certain phrases, such as "now I'll arbitarily change colour", "I made a mistake there" and "and he has a hat". So, I thought it might be interesting to analyse the transcripts of his videos. While looking to see how I could write some exercises for Khan here, I found what I hadn't realised I wanted: the subtitles for a number of his videos (1067 in 16 subjects, representing 200 hours worth of talking) as text files.
There's no particular aim with this project, and it is somewhat pointless, but I'm quite interested in text analysis (or natural language processing, to give it its fancy title), and may turn out to be helpful for analysing Chinese text, which is another project I'm working on. I've also found that there is a significant crossover between text analysis and bioinformatics (which is essentially analysing strings of characters).
I've started with some very basic analysis of the subtitle text. Firstly, the number of sentences per video for each of the different topics, which should give a rough measure of the relatively length of videos in each topic. This information is actually available here, but I've recalculated it to test my SRT (SubRip file format) parsing and sentence-splitting (not as easy as you might imagine if you want to avoid spliting decimal numbers).
Below is a bar chart of the data created with my Python DrawSVG module (with a manually-added tooltip). Mouseover a bar to see which subject it represents. I drew the graph with a black background to mimic the Khan video style, but I couldn't bring myself to use garish colours.
I was initially quite surprised to see that Developmental Math - the first bar - is so much shorter than the others, but I've looked at the videos and they are all very short, generally answering a single, simple maths question. The top three subjects are Linear Algebra, Chemistry and Biology, which makes sense as they are all relatively advanced topics with long videos. It's a bit odd that Organic Chemisty has noticely fewer sentences and Arithmetic has a relatively high number of sentences per video. I should add some error bars so I can make better comparisons.
So sentences per video gives us a very rough measure of subject complexity, but perhaps more informative, is sentence length. Below is a chart of the mean number of words per sentence for each subject. For this analysis words such as "we're" are counted as a single word, but hyphenated words are counted as two words. The order of subjects is the same as in the graph above.
This shows us the, although Arithmetic tend to have many sentences per video (~155), the sentences tend to be quite short (<10 words per sentence). Conversely, geometry, has relatively few sentences per video (~115), the sentences tend to be quite long (~15 words per sentence). This give us a bit more insight into the different subjects, but doesn't really tell us anything interesting.
There are number of other numbers we could look at, such as the number of words per subject, but I'm not sure how useful that is (if you want the answer multiple the values from the previous graphs). However, I have counted the total number of words in all subtitles and found 1,865,687, which will be useful for later analyses.
I've written a program that extracts all the individuals word from the subtitle text I have and counts the frequency of each. By my count there are 1798565 words, which works out at an average speaking speed of ~150 words per minute. According to Wikipedia, the recommended rate of speech for audiobooks is 150-160 words per minute, so this seems reasonable.
The most frequent words Sal uses on the Khan Academy are not particularly interesting; they are pretty much what you would expect from normal English speech. The top ten are: 'the', 'to', 'is', 'of', 'this', 'and', 'so', 'a', 'that' and 'we'. The top ten most common words in English are generally considered to be 'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have' and 'I' (note that these are lemmas, so 'be' includes all forms of the verb, e.g. 'is', 'are', 'were' etc.).
One interesting point I noticed in looking at the top ten most frequent words is that Sal uses the word 'we' more often than normal (10th rather than 27th). This makes sense to me as Sal's manner of speaking is generally very inclusive. So I thought I'd look at how frequently Sal uses each pronoun; here's a table of the results:
I was trying to think of an interesting set of words to consider and struck upon colours after watching yet another Khan video. It reflects somewhat Sal's choice of colour when writing (which gives me an idea for more analysis), although obviously he will mention colours for other reasons. The number of times each of the colours I checked (including two spellings of gray/grey) is shown below:
One of the questions I wanted to answer when I started this project was: What are the longest words Sal uses? And here's the answer.
When I first looked for the longest words I found that in several subjects, the longest word was [unintelligible], which was annoying (especially as two of the 'letters' are brackets, which I hadn't stripped out). I also found that the longest word in the Biology playlist was 'phosphoglyceraldehydees', which is a mispelling. After solving all those problems, I ran my search twice, in one including hyphenated words as a single word. The table below shows the longest words for each playlist and the number of letters they have (the number in parentheses if for when hyphenated words are included).
As one might predict, the longest words are all chemical names, incuding hyphens allows for longer chemical names (I'm not sure that sodium-potassium-chlorine should be hyphenated). Another point to notes is that the most common long words are 'straightforward', 'counterclockwise' and 'three-dimensional'. There are a couple of interesting hypenated phrases, most interesting of which is 'straight-from-the-udder'; I must find out what that's about.
Of course, after this analysis I had to watch the video with the longest words. In doing so, I actually found a longer word: 5-ethoxy-1,2-dimethylcyclohexanamine. Sal actually says "This was probably one of the longest words we've used in naming", and it would have been except that in the subtitles, there is no hyphen between the ethoxy and the 1,2, perhaps because the word is split over two lines.
I've started to look at the frequency at which Sal uses words. Unsurprisingly, the word he uses most commonly is 'the', which makes up 4.6% of his words, which is seems pretty standard. To find out what is standard I'm using a list of ~500,000 word counts from contemporary American English sources1, which I got from http://www.wordfrequency.info. According to them, the word 'the' should actually make up 5.6% of words, so it seems Sal under-uses it by quite a margin.
Before getting too much into an indepth analysis of which words Sal uses, I thought it would be interesting to see which words he never uses. Clearly there will be many words he has never used in the ~1.8 million words in the subtitles, so I narrowed my search to words that you would expect, based on the "normal" word counts, to appear > 300 times in 1.8 million words.
The most frequent "missing words" are all apostrophed "words", such as 's, 'll, 've, which don't show up in my counts because of how I've defined words (so maybe I should redefine them to be consistent). Below are some of the most frequent words that Sal never says in the videos in my analysis.
The most frequent real word that Sal never says in the subtitles I have is "American", which I admit I found quite surprising. However, I double-checked, and it seems to be true. He uses the word 'Americans' and 'German-American', but never plain 'American'. I actually find this quite pleasing, as the lesson are supposed to be for anyone in the world and maths and the sciences should be the same everywhere. If the Finance and History lessons were included then Sal will undoubtably use the word 'American'. Also in the top ten "missing words" is the word 'America'.
Many of the missing word are also highly context-specific, such as 'Bush', 'York', 'economic' and 'police'. However, one word that didn't follow this pattern was 'among' as it is a fairly every-day word. But I searched the captions, and indeed, Sal only ever uses the word 'amongst' (14 times as it happens), which is a valid alternative, though sometimes considered old-fashioned. This raises the possibility of identifying Sal-impersonators. Presumably it should be possible to generate a word-frequency fingerprint for everyone and see how likely a given piece of text was written by him or her. I'm definitely curious to see what my fingerprint would look like.
Another word that I found curious was 'qwq', which I initially took as a mistake in the table as I've never seen the "word" before. But according to Urban Dictionary, qwq is analogous to lol. Thank goodness Sal doens't use either. It does make me wonder where this list of contemporary American English comes from. I suspect it involves at lot of text from the web (which makes sense as it's easy to gather). Other slightly confusing words are 'san' (the Japanese suffix name suffix maybe), and 'wo', which I guess is an exclamation.
 Davies, Mark. (2011) Word frequency data from the Corpus of Contemporary American English (COCA).
I've skipped looking at the most frequent words Sal uses for now, and jumped straight to looking at the most frequent combinations of words.
Bigrams are combinations of two words. The five most common bigrams Sal uses are (with counts in parentheses):
Before running my analysis, I had assumed all the most frequent bigrams would be common English word combinations, but actually, I think only "this is" and "of the" are common in most text (though I'd like to check this). The bigram "going to" is probably quite common in English, but I suspect not as common as here. This must be because Sal frequently introduces what he is about to do before doing it. The bigrams "equal to" and "is equal" are probably relatively rare in English outside of mathematical discussions.
When we extend our search to trigrams, we see the most common bigrams get extended. The five most frequent are:
You can see that the bigram "going to" is extended in both directions to "(is) going to (be)", while "this is" is extended to "(so) this is (the)". When we move to 4-grams we can see how the "is equal to" fits in.
More the same, expanding the sentence fragments further.
You can see how these could be pieced together to form the fragment "is going to be equal to the same thing as". Other interesting 4-grams are "x is equal to" (1015), "the square root of" (691), "in the last video" (379) and "with respect to x" (309).
You can probably see where this is going now.
Clearly, the clause "this is going to be equal to" is very common. Interestingly, "is the same thing as" is also very common, and essentially the same thing. The fragment "let's see if we can" is very much part of Sal's inclusive style of talking. Other 5-grams include "both sides of this equation" (209), "so let's say I have" (130), "the limit as x approaches" (109) and "let's say I have a" (107).
To be continued...