Hello World

Introduction

For this exploration of language models, I'm going to imagine there's a group of primitive cave people trying to describe their world. We will create a simple language they will use to describe their world, and at the same time develop and AI that learns this language. The language they use will be a stripped down version of English, that has a very limited vocabulary and simple grammar.

Some simple sentences about sheep

DALL·E generated image of cave people with some sheep

In the beginning, our cave people look out at the world and see some sheep. They describe them with two sentences. These are the only possible sentences they can say.

Sheep are herbivores
Sheep are slow
- Some cave people

What is a language model?

A language model is a way of modelling a language in code. It works by looking at a large amount of example text and learning the statistical properties of the code. Specifically what it learns is, given a sequence of words, what is the probability for each word to come next. For example, given the sequence, "Once upon a", it should predict that the word "time" comes next with a high probability.

When you give a chatbot like ChatGPT a prompt such as "Tell me a fairy tale", that sequence of words is wrapped into a larger prompt along the lines of:

You are a helpful, smart computer agent. You are asked "Tell me a fairy tale", and you respond with:

In reality the prompt is a lot longer and more sophisticated, but that's the general idea. The language will then try to the predict the next word, maybe going for "Once". After that the entire prompt is fed back into the language model, but now with "Once" added on the end, and tries to predict the next word, maybe going with "upon". And so on.

So what we're aiming to build is a function that takes a string of text and returns what it thinks is the most likely next word (or a word picked a random using the probabilities it predicted for the next word). We're unlikely to reach the point where it can respond to a long prompt, but it should be able to generate sentences that describe the world we've created.

Tokenisation

In order to create a language model that can generate sentences, we create a neural network that learns to predict the next word in a sentence. The first step is to split the sentences into tokens. For this simple model, a token is just a word. I've also added a special token to indicate the start or end of a sentence, which will be useful later. I'm using for this token. It doesn't matter what the token is as long as it doesn't appear in a normal sentence.

If we split our two sentences into tokens we get:

, sheep, are, herbivores,
, sheep, are, slow,

Neural network

Now we'll create a neural network that takes one word (token) as an input, and returns its prediction of what word should follow it. The network has one input node for every possible token and one output node for every possible token. To start with, we'll create the simplest model which is where all the inputs connected directly to all the outputs.

Initially, the weight of each connection is random. We then feed in examples of tokens from our two sentences and see how closely the result matches our expectation e.g. given "sheep", we expect the output "are". The weights are then updated using back-propagation. There are many good explanations for how this works online, such as 3Blue1Brown's Neural Networks series. I'm using pytorch which handles the learning algorithm.

Results

After 10 000 iterations, the network looks like this, where blue connections have positive weights and yellow connections have negative weights. If you click an input node in the diagram below it will highlight it, its connections and the resulting output.

Currently it considers any output with a positive value a possible output. The magnitude of each value is fairly arbitrary. Later I'll described how to convert the output values into probabilities for a token.

If we just show the positive weights we can see more clearly how each input is converted to an output.

If we start with the token , see what token the network predicts next, then feed that token back into the network and keep going until we hit another token, we can generate a Markov chain. This shows that given our prompt toke, we will generate a sentence that starts with "sheep are". Then we have a choice of word: "herbivores" or "slow". And then the sentence simply ends.

Matrix view

So far, I've shown the networks as diagrams of nodes and connections, but how does this actually work in code?

The weights of the network are represented by a 5x5 matrix of numbers, where the weight[i][j], is the weight of the connection from input node i to output node j.

To pass a token into the network we give it a vector with a length equal the number of possible tokens where all the values are zero except for one value of one indicating the token we want. This is called one-hot encoding.

We then multiply this input vector by the matrix of weights to get an output vector. The output vector is effective the row of weights corresponding to the connections from the input node. We then apply a softmax to this vector, which means applying the exponential function to each value to ensure they are positive, and then normalising the values. We can then use these numbers as the probability for each token that it follows the given input token.

Click on the input token on the diagram below to see this process in action.

Conclusion

We created a simple neural network model that consisted of a single matrix. It was able to learn how to generate two simple sentences about sheep. In the next article, we'll look at how to represent words with vectors (other than by one-hot encoding).