Science · Code · Curiosity
Going Deeper
Exploring the effect of adding a hidden layer to a neural network.
Introduction
In the previous article, we built a simple neural network that had a layer of input nodes and a layer of output nodes. Deep learning models are called deep because they have multiple layers of nodes. In this article, we'll make our model slightly deeper by adding a hidden layer of nodes between the input and output layers. We'll then explore what this hidden layer is doing.
Adding a hidden layer
To add a hidden layer, we need to create two matrices instead of one. The first matrix connects the input layer to the hidden layer, and the second matrix connects the hidden layer to the output layer. Each matrix is trained in the same way as before, using back-propagation to adjust the weights based on the error in the output.
This is what the network looks like after 100 000 steps of training. Blue indicates a positive weight, and red indicates a negative weight. Click on an input word to see how the activations flow through the network.
Network size
While the network might look bigger with the extra layer, it actually has fewer weights than before. The previous network consisted of a single 5 x 5 matrix, giving 25 values. This network has a 5 x 2 matrix and a 2 x 5 matrix, giving a total of 20 values. (In practise, this setup has 7 bias values too, but we'll ignore those for simplicity.)
This might seem like a small difference, but as we increase the number of words in our vocabulary, the difference becomes more significant. The first network has V x V weights to learn, where V is the vocabulary size. The second network has 4 x V weights to learn. For a vocabulary of 10 000 words, that's 100 million weights versus 40 000 weights. This assumes we keep the hidden layer size fixed at 2, but even if we increase it to 100, that's still only 2 million weights to learn. We'll discuss below the impact of changing the size of the hidden layer.
Embedding vectors
In the network above, the connections coming out of the input nodes are:
| Input | Connection 1 | Connection 2 |
|---|---|---|
| <BR> | blue | red |
| are | red | blue |
| herbivores | blue | blue |
| sheep | red | red | slow | blue | blue |
Another way we can look at it is to write the blue as 1 (meaning the node in the hidden layer is activated), and red as 0 (meaning the node is inactivated). We can then write a vector for the hidden nodes for each input.
| Input | One-hot encoding vector | Embedding vector |
|---|---|---|
| <BR> | [1, 0, 0, 0, 0] | [1, 0] |
| are | [0, 1, 0, 0, 0] | [0, 1] |
| herbivores | [0, 0, 1, 0, 0] | [1, 1] |
| sheep | [0, 0, 0, 1, 0] | [0, 0] | slow | [0, 0, 0, 0, 1] | [1, 1] |
I've included the original one-hot encoding vector that represent the input token. This shows the initial 5-dimensional vector is compressed into a 2-dimensional vector. This compression to a smaller number of dimensions means as we add more intermediate matrices to our network, they can be smaller, which allows us to save memory, and time training.
We're able to compress the original vectors because in one-hot encoding only one digit is a one, so most of the possible vectors are wasted. In the embedding vectors, we're using all possible permutations of ones and zeros. Another reason we're able to compress the vectors is that 'herbivores' and 'slow' are compressed to the same vector [1, 1]. That's because given the sentences we have, the two words are interchangeable. So the network has effectively made a generalisation.
In fact, embedding is more useful than I've described as the embedding matrix values are not limited to ones and zeros, so there is more space for tokens to exist. If we plot the actual values on a 2D chart, we can see each token appears in a different quadrant (except 'herbivores' and 'slow') which appear in the exact same space.
One thing to note is that the embedding vectors are arbitrary. If I run the training again, the tokens are likely to be embedded as different vectors. However, 'herbivores' and 'slow' will always have the same value, and the tokens will maximise the distance between themselves. We'll exploring this idea more in the next article.
Conclusion
In this article we have shown how making the network "deeper", by adding another layer of nodes allows us to have fewer weights to learning. It also allows us to create a compressed representation of token which we can show in 2D space. In the next article, we'll see how our network handles an expanded vocabulary.
Leave a comment
Comments are moderated and will appear after approval.