Neural Networks for Language

Embeddings and Word Representations

Learning Objectives

You know what embeddings are and why they are needed.
You understand what word embeddings are and how they capture semantic meaning.
You know at a high level how word embeddings are learned using neural networks.

Why embeddings?

Neural networks operate exclusively on numbers — they perform mathematical operations like multiplication, addition, and applying activation functions to numerical values. However, much of the data we want to process — text, images, categorical variables — is not inherently numerical. Words like “cat” or “understand” are symbols, not quantities that can be directly added or multiplied.

We need a way to convert these non-numeric entities into numerical representations that computers can process mathematically. This conversion process is called creating an embedding: a numerical representation that captures important properties of the original data in a form that neural networks can work with.

The challenge is not merely to assign numbers arbitrarily, but to create representations that preserve meaningful relationships from the original domain. A good embedding should encode relevant information in a way that the neural network can exploit for its task — similar items should have similar representations, and important differences should be reflected in the numerical structure.

Loading Exercise...

A simple example: Rock-paper-scissors

To understand the challenge of creating good representations, consider how we might encode the three moves in rock-paper-scissors as numbers.

Naive approach: Assign simple integers: rock = 0, paper = 1, scissors = 2.

Problem: This representation implies an artificial ordering, as if scissors (2) is “greater than” paper (1), which is “greater than” rock (0). This ordering doesn’t reflect the actual game relationships where each move beats one option and loses to another in a circular pattern. A neural network might incorrectly learn patterns based on this false numerical ordering, potentially treating the difference between scissors and rock as larger than the difference between rock and paper simply because 2 - 0 = 2 while 1 - 0 = 1.

Better approach: Use a vector with one position for each possible move, placing a 1 in the position corresponding to the chosen move and 0 everywhere else:

Rock = [1, 0, 0]
Paper = [0, 1, 0]
Scissors = [0, 0, 1]

This representation is called one-hot encoding. Each move is represented by a vector of length 3 (the number of possible moves), with exactly one “hot” (set to 1) position. This avoids creating false orderings — rock, paper, and scissors are all equally distant from each other in this representation (the distance between any two different one-hot vectors is the same), with no move privileged over others.

If we built a neural network to predict rock-paper-scissors moves, the input layer would have three nodes (one for each position in the input vector), and the output layer would also have three nodes (producing probabilities for each possible next move). Figure 1 below shows such a network with one hidden layer with two nodes.

Figure 1 — A simple neural network for predicting rock-paper-scissors moves using one-hot encoded inputs and outputs.

Depending on the data, weights, biases, and activation functions, the above network could learn to predict the next move based on the current move. For example, if the data has many instances where rock [1, 0, 0] is followed by paper [0, 1, 0], the network might learn to associate rock with a high probability of paper. So when the input is [1, 0, 0] (rock), the network might learn to output [0, 1, 0] (paper).

In practice, when learning to avoid losing, the network would learn to predict the move that beats the opponent’s last move. If the opponent played rock, the network should predict paper to win.

Loading Exercise...

Word embeddings

Representing words as numbers presents a far more complex challenge than rock-paper-scissors. Unlike game moves, words have rich semantic relationships that we would like to capture:

Words like “understand” and “comprehend” are nearly synonymous — they have very similar meanings and can often substitute for each other in sentences.
Words like “understand” and “eat” are semantically distant — they refer to completely different concepts with no obvious relationship.
Some words share partial meaning: “cat” and “kitten” both refer to felines, but carry different connotations about age and size.

If we used one-hot encoding for words (as we did for rock-paper-scissors), we would create vectors like:

“understand” = [0, 0, 0, 1, 0, 0, ...] (assuming it’s the 4th word in our vocabulary)
“comprehend” = [0, 0, 0, 0, 0, 1, ...] (perhaps the 6th word)
“eat” = [1, 0, 0, 0, 0, 0, ...] (perhaps the 1st word)

With one-hot encoding, every word is represented by a vector that is equally distant from every other word — “understand” is just as different from “comprehend” as it is from “eat” in terms of vector distance. The representation completely ignores semantic relationships between words. For a vocabulary of 50,000 words, we would need 50,000-dimensional vectors where each word differs from every other word in exactly two positions (its own position and the other word’s position), making all pairwise distances identical.

To address this limitation, researchers developed word embeddings: dense, lower-dimensional numerical representations of words that encode information about their meanings and relationships.

Word embeddings are typically dense vectors with continuous values like [0.23, -0.11, 0.76, 0.45, -0.33, ...] rather than sparse one-hot vectors containing mostly zeros. These embeddings are learned from large amounts of text data, and the training process ensures that words with similar meanings end up with similar vector representations. For example, “understand” and “comprehend” would have embeddings that are close to each other in the vector space (small distance between them), while “understand” and “eat” would be far apart (large distance).

Loading Exercise...

How embeddings capture meaning

Word embeddings are learned by training neural networks on language tasks. The key insight underlying this approach is that words that appear in similar contexts tend to have similar meanings — an idea rooted in the linguistic principle we mentioned earlier: “You shall know a word by the company it keeps.”

Common training tasks include:

Predicting a word from its context: Given surrounding words with a blank like “I ??? the meaning of the word”, the task would be to predict the word for the position ???. This would allow learning that “understand” is a good fit. At the same time, words like “comprehend,” “grasp,” or “know” would also fit well — during training, the network learns to assign similar representations to words used in similar contexts.
Predicting the next word in a sequence (language modeling): Given “I understand the meaning of the ???” predict what word comes to ??? — likely “word,” “sentence,” “text,” or similar completions.

During training on millions of sentences, the neural network adjusts its internal weights through backpropagation. Words that frequently appear in similar contexts — like “understand” and “comprehend” — gradually develop similar internal representations because they help the network make similar predictions in similar situations.

There is no need to explicitly (or manually) state that “these words mean similar things” — similarity emerges naturally from training on huge amounts of text data.

The result is a learned embedding space where:

Similar words cluster together: “cat,” “kitten,” “feline” occupy nearby regions of the vector space
Semantic relationships are encoded geometrically: The direction from “king” to “queen” is similar to the direction from “man” to “woman,” capturing gender relationships
Distance reflects semantic dissimilarity: Unrelated words like “understand” and “refrigerator” are far apart in the space

Loading Exercise...

Learning word embeddings with neural networks

A typical architecture for learning word embeddings looks like this:

Input layer: One node for each word in the vocabulary. For a vocabulary of 50,000 words, this layer has 50,000 nodes. The input is represented as a one-hot vector indicating which word we’re currently processing.

Hidden layer (embedding layer): A much smaller set of nodes — typically a few hundred, though modern models sometimes use larger dimension sizess. This layer has far fewer nodes than the vocabulary size, creating a “bottleneck” that forces the network to learn compressed, meaningful representations rather than simply memorizing which words appeared where.

Output layer: Predicts the target word(s) for the training task. For example, this might output probabilities for each word in the vocabulary appearing in the predicted position.

During training:

The network processes thousands or millions of examples from text corpora.
The weights connecting the input layer to the hidden layer are adjusted through backpropagation to improve predictions on the training task.
These learned weights become the word embeddings. Each word’s embedding is the set of weights from its one-hot input node to all the hidden layer nodes.

For example, if the hidden layer has 300 nodes, each word is represented by a 300-dimensional vector — the 300 weights connecting that word’s input node to the hidden layer nodes. These vectors are dense (most values are non-zero) and continuous (values can be any real number, not just 0 or 1).

That is, a single word embedding vector might have consisted of just 300 numbers (the weights from the input node representing the single word to the hidden layer nodes), rather than a sparse vector of 50,000 dimensions with only one non-zero entry.

After training, we can measure similarity between words by comparing their embedding vectors using metrics like cosine similarity, which measures the angle between two vectors. Vectors pointing in similar directions (small angle) have high cosine similarity, indicating semantic similarity between the words they represent.

Loading Exercise...

Word2Vec and beyond

One of the most influential methods for learning word embeddings is Word2Vec, introduced by researchers at Google in 2013. Word2Vec trains a shallow neural network to predict either:

Given a word, predict the surrounding words in a text window (e.g., given “understand,” predict nearby words like “I,” “the,” “meaning”).
Given surrounding words, predict the middle word (e.g., given “I ___ the meaning,” predict “understand”).

The hidden layer weights from this training become the word embeddings — no additional processing or extraction is needed beyond reading these learned weights.

Word2Vec revealed that word embeddings can capture patterns in language and that these patterns can also, at times, be expressed through simple vector arithmetic. Some famous examples include:

king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
walking - walked + swimming ≈ swam

These patterns emerge naturally from training on large text corpora without explicit programming of grammatical or semantic rules. The network discovers these relationships simply by learning to predict word contexts.

Loading Exercise...

← Training a Neural Network

Sequential Models (RNNs, LSTMs) →