Neural Networks for Language

Sequential Models (RNNs, LSTMs)


Learning Objectives

  • You understand why sequential data poses unique challenges for neural networks.
  • You know what recurrent neural networks (RNNs) and long short-term memory (LSTM) networks are, at a high level.
  • You understand the vanishing gradient problem and why LSTMs help address it.

Why sequential data is challenging

Many types of data are fundamentally sequential — they consist of ordered elements where position and order carry crucial meaning. Examples include text (sentences and documents), speech (sequences of sounds), music (notes over time), and time series data like stock prices or weather measurements. In these cases, order is essential to interpretation:

  • “The cat chased the dog” means something completely different from “The dog chased the cat” — reversing the order changes who is doing the chasing and who is being chased
  • Predicting tomorrow’s stock price requires understanding trends over recent days or weeks, not just today’s isolated value
  • Understanding spoken language requires processing sounds in the order they occur — hearing phonemes out of order would make speech incomprehensible

A basic feed-forward neural network (which we have so far looked into) has no inherent mechanism for handling sequential structure. Each input is processed independently without regard to position or order. We might attempt to work around this by adding more input nodes to accommodate multiple time steps:

  • For 2 rock-paper-scissors moves: 6 input nodes (2 moves × 3 options each)
  • For 3 moves: 9 input nodes
  • For 10 moves: 30 input nodes

This approach quickly becomes impractical and faces the same fundamental issues as n-gram models, including being only able to handle fixed-length sequences.

Loading Exercise...

Recurrent neural networks (RNNs)

Recurrent neural networks (RNNs), developed in the 1980s, introduced a different approach to handling sequential data. They resemble ordinary neural networks but with one crucial architectural difference: each node has a recurrent connection that feeds its output back to itself at the next time step.

This creates a form of memory, allowing information to persist across time steps. Effectively, the network maintains a hidden state that summarizes what it has seen so far in the sequence.

Think of reading a story word by word. As you encounter each new word, you don’t process it in isolation — you carry forward your evolving understanding of what came before. An RNN works similarly: it processes one element of the sequence at a time while maintaining a “hidden state” that summarizes what it has observed so far.

How RNNs process sequences

When processing the sentence “the cat sat,” an RNN operates as follows:

  1. Process the word “the” → produce hidden state h₁ (encoding information about seeing “the”)
  2. Process the word “cat” along with h₁ → produce hidden state h₂ (encoding information about “the cat”)
  3. Process the word “sat” along with h₂ → produce hidden state h₃ (encoding information about “the cat sat”)

Each hidden state serves as both an output and an input for the next time step, creating a chain of processing where earlier information influences later processing.

To train an RNN, we conceptually “unfold” it through time into multiple copies — one for each time step in the sequence. The same weights are shared across all time steps (the network applies the same transformation at each step), and we can apply backpropagation through this unfolded network, a process called “backpropagation through time.”

Loading Exercise...

The vanishing gradient problem

RNNs introduced an important capability for handling sequential data, but they suffer from a limitation called the vanishing gradient problem.

During training, error signals must backpropagate through many time steps to update the network’s weights. As these flow backward through time, they get multiplied repeatedly by weight matrices and derivatives of activation functions. When these values are less than 1 (which commonly occurs), gradients shrink exponentially with each multiplication:

  • After 10 time steps, a gradient might diminish to only 10% of its original magnitude
  • After 20-30 time steps, it may effectively vanish to near-zero values
  • After 50+ time steps, it becomes numerically insignificant

As a consequence, information from early parts of sequences do not produce meaningful learning signals for updating weights. If understanding a sentence correctly requires remembering a word from 20 or 30 positions earlier, standard RNNs struggle to learn this connection because the gradient signal from the error has vanished by the time it reaches the relevant earlier time step.

This limitation restricts what RNNs can learn, particularly for tasks requiring long-term dependencies like understanding paragraph-level context or translating complex sentences.


Long short-term memory (LSTM) networks

To address the vanishing gradient problem, researchers developed Long Short-Term Memory (LSTM) networks in 1997.

LSTMs are specialized RNNs with a more sophisticated internal structure designed specifically to maintain information over long sequences. Instead of a simple recurrent connection passing a hidden state forward, each LSTM unit contains:

  • A memory cell that stores information across many time steps without transformation, providing a path for information to flow with minimal degradation.

  • Three learned gates that control information flow:

    • Forget gate: Decides what information from the memory cell should be discarded (which memories to let fade)
    • Input gate: Decides what new information should be stored in the memory cell (which new memories to form)
    • Output gate: Decides what information from the memory cell should be exposed as output (which memories to use right now)

These gates are themselves small neural networks (typically single layers with sigmoid activations producing values between 0 and 1) that learn appropriate behavior during training. A gate value near 0 means “block this information,” while a value near 1 means “pass this information through.”

Why LSTMs work better

The memory cell provides a path for gradients to flow backward through time with minimal transformation. Information can pass through the memory cell across many time steps without being repeatedly multiplied by small weight values.

The gates learn when to preserve information (keep the forget gate open), when to incorporate new information (open the input gate), and when to use stored information (open the output gate). This learned control allows the network to selectively maintain relevant information while discarding irrelevant details.

This architecture allows LSTMs to retain important information for dozens or even hundreds of time steps, making them more effective for tasks requiring long-term dependencies than basic RNNs.

Loading Exercise...

Limitations that remain

LSTMs represented a major advance and dominated sequence modeling from the late 1990s through the mid-2010s, enabling many applications that standard RNNs could not handle. However, they still face significant limitations: LSTMs process sequences one step at a time in order, making training computationally slow. The hidden state at step 100 depends on step 99, which depends on step 98, and so on — this sequential dependency cannot be eliminated.

Because each step depends on the previous step’s hidden state, we cannot process multiple time steps simultaneously during training. This contrasts with feed-forward networks where all computations can occur in parallel (but the input layer must allow entering the whole sequence for sequential data).

Furthermore, the gates and memory cell make each LSTM unit more complex than a standard RNN unit, requiring more computation per time step, and they still struggle with sequences of hundreds or thousands of time steps.

These remaining limitations motivated development of attention mechanisms and transformer architectures, which we’ll explore in subsequent chapters.

Loading Exercise...