Neural Networks for Language

Attention Mechanisms

Learning Objectives

You understand why attention mechanisms were needed beyond embeddings, RNNs, and LSTMs.
You know the basic idea of attention: focusing on the most relevant parts of a sequence.
You understand what self-attention is and how it contextualizes word meanings.

The remaining problem

So far in this part, we’ve learned that:

Embeddings convert words into numerical vectors that capture semantic meaning and relationships
RNNs and LSTMs process sequences step by step, maintaining memory of past information through hidden states

Together, these innovations represented substantial progress toward effective language modeling. However, even LSTMs face limitations. They process information sequentially, which means that information from early in the sequence must pass through many intermediate steps to influence later processing; this can take significant computational resources and the information can degrade or be lost along the way.

The limitation suggested the need for a fundamentally different approach — one where each word could directly connect to any other relevant word, regardless of distance in the sequence.

The core idea of attention

Consider how humans read and understand sentences. You don’t treat every word with equal importance or process them in a purely linear fashion. Instead, you focus more attention on certain words to extract meaning, and you automatically connect related words even when they appear far apart.

Consider the sentence:

The book that the student borrowed from the library was fascinating.

When you reach was fascinating, you immediately understand it describes the book, not the library or the student. You mentally connect these words despite seven intervening words. This selective focusing — paying more attention to relevant words while largely ignoring irrelevant ones — is what attention mechanisms provide to neural networks.

The attention mechanism allows the model to learn which words are most relevant for understanding each word in context, and to weight their contributions accordingly.

How attention actually works

The attention mechanism uses three learned transformations of each word’s embedding:

Query: Represents what this word is “looking for” in other words — what information it needs to understand itself in context
Key: Represents what this word “offers” to other words — what information it can provide to help understand them
Value: Represents the actual information this word carries that will be passed along

To compute attention for a given word:

Compute the query vector for that word
Compare this query with the key vectors of all words using dot products (measuring similarity)
Apply a softmax function to convert these similarity scores into weights that sum to 1.0
Use these weights to compute a weighted sum of all the value vectors

The output is a new representation that emphasizes information from words the query deemed relevant. All of this happens through matrix operations that modern hardware (especially GPUs) can compute very efficiently in parallel across all words simultaneously.

Intuitively, you could think of this as each word “asking questions” (via its query) of all the other words, listening more closely to the ones that provide the most relevant answers (through their keys), and then updating its own understanding by blending their values in proportion to how useful they are.

Self-attention: Applying attention everywhere

If we compute attention for every word in the sequence, we get self-attention — the sequence attends to itself:

Each word compares itself against all words (including itself) to determine which are relevant
Each word updates its representation based on the weighted combination of relevant words

Crucially, every word can attend to any other word with equal computational ease, regardless of distance in the sequence. There’s no information degradation over long sequences as occurs with RNNs where information must flow through many intermediate hidden states. The attention mechanism provides direct connections between any pair of positions.

As a result, every word’s representation becomes contextualized — it reflects not just the word itself but also its relationship to surrounding words.

Why attention was key

The key result of the attention mechanism was that it enabled dynamic, contextualized representations. The representations of words update based on the surrounding context, which leads to each occurrence of individual words receiving context-specific representations. As an example, in one sentence, the word bank might refer to the bank of a river, while in another sentence, the word might refer to a bank that stores money.

This contextual sensitivity allows models to capture the nuanced, context-dependent nature of language where the same word can mean different things in different contexts.

In addition, from the computational perspective, attention computations for all words can happen simultaneously (instead of sequences one word at a time as in LSTM or RNN). This allows parallelization of training; furthermore, attention mechanisms work well even for very long sequences, as each word can directly access information from any other word without having to pass through a chain of intermediate steps. This eliminates the problem of long-distance dependencies degrading over time, making it easier for the model to capture relationships between far-apart words.

These two advantages — parallelization and global context handling — are what made attention the foundation for the Transformer architecture, which we discuss next.

Loading Exercise...

← Sequential Models (RNNs, LSTMs)

Transformers →