Expanding our Vocabulary

Introduction

So far we've created a language that consists of two sentences and four different words, and built an AI that can generate those two sentences. In this article we'll expand the number of possible sentences and see how our model handles the larger (but still tiny) language.

Sheep eat grass

The first new sentence we'll add is "sheep eat grass", which give use two new tokens:, "eat", and "grass". The Markov chain of sentences now looks like this.

We keep our general network structure with a hidden layer of two nodes, but now our input and output layers have 7 nodes. After training for 10 000 iterations we have a model that can reliably generate our three sentences.

If we look at the token embedding, we see that "grass" is near "slow" and "herbivores", which makes sense as all words are followed by the sentence end token.

The "eat" token is close to the origin, which allows it be distinct from the other tokens. Again, the exact position of the tokens is not important, just that they maximise their distance from one another.

Rabbits like running

The next sentence we'll add is:

rabbits like running

This introduces a whole new animal, plus three more tokens, bringing us to a total of ten.

AI generated image of cave people with sheep and rabbits.

Now our chain of possible sentences looks like this, with branching at every level (though not at every token).

The token embedding now looks like this. Again, all sentence-ending tokens are clustered together, while the other tokens are spread out.

This time, when using the model to generate sentences I did get one incorrect sentence: "sheep eat". I think this is because the "eat" token is relatively close to the sentence-ending tokens, so there's a small probability of it predicting the end of the sentence after "eat".

Allowing the training to run longer, would probably fix this issue, though it does suggest we are getting to the limits of what a hidden layer of two nodes can differentiate. Still, let's keep going.

The similarities between rabbits and sheep

So far, I've been creating sentences that might start the same, but then end differently. What about adding some sentences that start differently, but then become the same. For example, rabbits and sheep have some similarities:

rabbits are herbivores
rabbits eat grass

Our sentences overlap a bit more and there tokens other than the final <BR> token, that have multiple inputs. This means there should be more similarities between tokens for our model to uncover.

There is a slight issue with our sentences which I'll come back to later, but for now let's look at the token embedding. I trained the model four times so I could be sure any patterns I saw were repeated.

The first thing to notice are cluster of four tokens, which I've not labelled to make the graphs clearer. These correspond to the tokens "herbivores", "slow", "grass", and "running". As usual they cluster together as tokens that are followed by the sentence ending.

The other pattern I noticed (and the reason I ran the training four times to confirm) is that "sheep" and "rabbit" are always quite close together. It makes sense that they would be as both are the subject the sentences, and more importantly, both can be followed by "are" or "eat".

It's all starting to go wrong

When I use this latest model to generate sentences, I sometimes generate a sentence that isn't in the training example, perhaps you can guess what it is. Later on we will want our model to generate novel sentences, and we could see this as making a generalisation. Unfortunately it is an incorrect generalisation or an hallucination.

The sentence this model generates is

rabbits are slow

If you look at the chain of possible sentences above, you can see that if we are at the token "are", our possible predictions for the next token are "herbivores" and "slow". We should only pick "slow" if the token before "are" was "sheep", but our model has no knowledge of came before it's current token. This is problem we will address in the next article.

Making things worse

Since we've broken our model, let make it worse by removing the rabbits and introducing a wolf. We will train on four sentences:

sheep are slow
sheep are herbivores
sheep eat grass
wolves eat sheep

Now if we train our model and use it to generate sentences we can generate our example sentences, but we also generate some novel (and incorrect) sentences:

wolves eat sheep eat sheep
sheep
wolves eat grass
wolves eat sheep are herbivores
sheep eat sheep eat grass

There are two issues now. The first is the same as before, that when we reach an "eat" token, we don't know what preceded it, so we can get sentences like "sheep eat sheep" and "wolves eat grass". Similarly, we can get sentences like "sheep" because that token can be the first and last word in a sentence.

If we look at the language chain, we can see the second issue. Because "sheep" can now be the subject or the object of a sentence, we have a loop from "sheep" to "eat" and back again.

(Note, to keep the diagram simple I've merged the "slow" and "herbivore" tokens as they are interchangeable).

The loop in the language chain is what causes the model to generate sentences like "wolves eat sheep are herbivores". In theory we could generate infinitely long sentences, like "sheep eat sheep eat sheep eat ...". This is an essential feature of real languages, so we need to be able handle it. Simply trying to predict the next token based on our current token is not going to work any more; we need to give our model more context.