Neural Networks for Language

Transformers

Learning Objectives

You know what the transformer architecture is and its key components.
You understand how transformers build on self-attention mechanisms.
You know why transformers replaced RNNs and LSTMs in most modern NLP tasks.

From sequential processing to pure attention

Earlier chapters showed that RNNs and LSTMs process sequences one word at a time:

They maintain hidden states that carry information forward through the sequence
This approach works reasonably well for shorter texts but becomes slow and inefficient for long sequences
Important information can degrade as it passes through many sequential steps

Self-attention addressed these problems by allowing each word to directly attend to all other words in the sequence. However, self-attention alone doesn’t constitute a complete model — we needed a full architecture built around this mechanism.

The transformer architecture

In 2017, researchers at Google introduced the Transformer in their paper Attention Is All You Need.

The title reveals the central insight: instead of relying on recurrence (RNNs/LSTMs) to process sequences, transformers rely on attention to model relationships between positions in sequences.

Loading Exercise...

Core components

Transformers incorporate three main architectural innovations:

Self-attention layers allow each position to attend to all positions across the entire sequence, regardless of distance. This creates direct connections between any pair of positions without information needing to flow through intermediate states.
Stacked layers repeat the attention and processing multiple times. Each layer refines the representations further, learning increasingly abstract patterns and relationships. Modern transformers typically stack 12, 24, or even 96+ layers, with each layer building on the representations from the previous layer.
Parallel processing enables processing all positions simultaneously rather than sequentially. Unlike RNNs that must wait for position N-1 before processing position N, transformers can process all positions at once. This makes training dramatically faster on modern parallel computing hardware like GPUs.

The architecture also includes position-wise feed-forward layers (processing each position’s representation independently through the same neural network), layer normalization, and residual connections — technical components that help training stability and model performance — but the core innovation is stacking self-attention to build increasingly sophisticated representations.

Loading Exercise...

Two components: encoder and decoder

The original transformer architecture had two main components serving different purposes:

Encoder processes the input sequence and builds contextualized representations
Decoder generates output sequences one position at a time

This encoder-decoder design proved powerful for sequence-to-sequence tasks like machine translation, where you need to both understand the input and generate appropriate output.

However, modern applications typically use just one of the components, depending on the task:

Encoder-only models excel at understanding and analysis tasks: text classification, question answering, information extraction, and sentiment analysis
Decoder-only models excel at generation tasks: text completion, dialogue, creative writing, and code generation

Most contemporary large language models like GPT are decoder-only.

Loading Exercise...

The foundation for LLMs

This architectural foundation enabled the development of large language models that have transformed natural language processing:

BERT (2018) with 340 million parameters demonstrated the power of large-scale pre-training on masked language modeling, achieving state-of-the-art results across many understanding tasks.
GPT-2 (2019) with 1.5 billion parameters showed impressive text generation capabilities and the beginnings of few-shot learning — performing tasks from just a few examples.
GPT-3 (2020) with 175 billion parameters exhibited emergent capabilities like few-shot and zero-shot learning, where the model could perform many tasks without task-specific training.
Modern models continue scaling to hundreds of billions or even trillions of parameters, with capabilities continuing to expand.

In the next part of this course, we’ll explore how these models are trained at scale, what capabilities emerge from that training, the techniques used to align them with human intentions, and the recent trends in the area.

Loading Exercise...

← Attention Mechanisms

Summary →