Introduction

For this exploration of language models, I'm going to imagine there's a group of primitive cave people trying to describe their world. We will create a simple language they will use to describe their world, and at the same time develop and AI that learns this language. The language they use will be a stripped down version of English, that has a very limited vocabulary and simple grammar.

Some simple sentences about sheep

DALL·E generated image of cave people with some sheep

In the beginning, our cave people look out at the world and see some sheep. They describe them with two sentences. These are the only possible sentences they can say.

Sheep are herbivores
Sheep are slow

- Some cave people

Tokenisation

In order to create a language model that can generate sentences, we create a neural network that learns to predict the next word in a sentence. The first step is to split the sentences into tokens. For this simple model, a token is just a word. I've also added a special token to indicate the start or end of a sentence, which will be useful later. I'm using <BR> for this token. It doesn't matter what the token is as long as it doesn't appear in a normal sentence.

If we split our two sentences into tokens we get:

  • <BR>, sheep, are, herbivores, <BR>
  • <BR>, sheep, are, slow, <BR>

Neural network

Now we'll create a neural network that takes in one word (token) as an input, and returns its prediction of what word should follow it. The network has one input node for every possible token and one output node for every possible token. To start with, we'll create the simplest model which is where all the inputs connected directly to all the outputs.

<BR> <BR> are are herbivores herbivores sheep sheep slow slow

Initially, the weight of each connection is random. We then feed in examples of tokens from our two sentences and see how closely the result matches our expectation e.g. given "sheep", we expect the output "are". The weights are then updated using back-propagation. There are many good explanations for how this works online, such as 3Blue1Brown's Neural Networks series. I'm using pytorch which handles the learning algorithm.

Results

After 10 000 iterations, we can look at the new weights of the network. If we just show connections with a positive weight, the network looks like this.

<BR> <BR> are are herbivores herbivores sheep sheep slow slow

With so few tokens and connections, it's relatively easy to verify that the network has learnt our sentences. We can also start with the token <BR>, see what token the network predicts next, then feed that token back into the network and keep going until we hit another <BR> token.

We can use the transitions from input to output to generate a Markov chain which clearly shows the two sentences we can generate.

<BR> sheep are slow herbivores <BR>

Conclusion

We created a simple neural network model that consisted of a single matrix. It was able to learn how to generate two simple sentences about sheep. In the next article, we'll look at how to represent words with vectors.