Neural Networks and Language Models

Sequential Data and Neural Networks

Learning Objectives

You know of challenges related to using sequential data with neural networks.

Many types of data — such as text — is sequential, leading to a need to train neural networks on sequential data.

With classic neural networks, sequential data could be in principle handled in the input layer as additional nodes. As an example, for learning to predict the next move from sequences of two rock paper scissors moves, we could have six nodes in the input layer. Similarly, for learning to predict the next move from sequences of three rock paper scissors moves, we could have nine nodes in the input layer. And so on.

This approach would lead to the same issue that was present in the n-gram language model — capturing long range dependencies can become infeasible.

Loading Exercise...

A key advance in this front has been the introduction of recurrent neural networks (RNNs, 1980s). Recurrent neural networks (RNNs) are a type of neural network that has layers, weights, biases, and activation functions, just like the neural networks that we discussed earlier. In addition, recurrent neural networks use the outputs from each node as input to the same node, creating a possibility to provide feedback to the node and to “remember” earlier inputs.

The feedback functionality can be viewed as a loop, where output of the node is multiplied by a weight, and then fed back to the node. The weight, like all weights, is learned during training. This allows the output of a node to influence the input of the same node, which through training yields the possibility to capture information at longer distances.

The feedback loop is in practice implemented by adding new layers for each step in the input sequence (unfolding the loop). The longer the input sequences are, the more layers there will be due to the unfolding, and the more weights there are to pass information about the observed error between the actual output of the network and the expected output during backpropagation. Due to networks learning weights, this involves also a situation where the weights of the early parts of the network may receive very little information about the changes. This problem is called the vanishing gradient problem.

The choice of an activation function also has an effect. The ReLU activation function, as an example, is less suspectible to the problem than some of the other activation functions (mainly because ReLU changes all negative numbers to zero).

Loading Exercise...

The vanishing gradient problem was in part addressed by the introduction of Long-short term memory (LSTM, 1997) networks, which are a type of RNN that are better at learning long-term dependencies. LSTM features a gating mechanism in each of the nodes, which allows deciding whether to “forget” information, allowing more efficient training of neural networks for sequential data. Even with LSTM, however, the issue of training neural networks on long sequences is a challenge.

Loading Exercise...

← Embeddings and Word Embeddings

Self-Attention and Transformers →