Neural Networks for Language

Basics of Neural Networks

Learning Objectives

You know what a neural network is and have a high-level understanding of how it works.
You understand the role of layers, weights, biases, and activation functions.

Earlier, we explored n-gram models and their fundamental limitations: they can only examine a short, fixed window of previous words and struggle to capture long-range patterns or semantic relationships. To address these limitations, researchers turned to neural networks — computational models loosely inspired by the structure of biological brains, capable of learning and representing far more complex relationships in data.

A high-level overview only

This course aims for an intuitive understanding of neural networks rather than mathematical rigor. You don’t need to master calculus and linear algebra to grasp the key concepts: networks have layers of nodes, connections with weights that determine information flow, small adjustments called biases, and activation functions that help capture nonlinear patterns in data.

For deeper mathematical treatment, see the free online book Dive into Deep Learning.

Neural networks: The big picture

A neural network is a machine learning model composed of layers of interconnected nodes (also called neurons or units). These nodes are organized into distinct layers:

Input layer: Receives raw data as numerical values. For language models, these might be numerical representations of words or subword tokens.
Hidden layers: One or more intermediate layers that process and transform data through weighted connections. These layers extract increasingly abstract features from the input. A network might have dozens or even hundreds of hidden layers in modern deep learning systems.
Output layer: Produces the final prediction or result. For a language model, this typically outputs probabilities for what word or token comes next in the sequence.

Figure 1 below illustrates a simple neural network with an input layer, two hidden layers, and an output layer. It is a fully-connected network, meaning every node in one layer connects to every node in the next layer.

Figure 1 — A fully-connected neural network with four layers: input layer, two hidden layers, and an output layer.

You can think of the network as a series of transformations that gradually convert raw input numbers into useful predictions. Each layer extracts different patterns or features from the data, with early layers often detecting simple patterns and deeper layers combining these into more complex abstractions.

Loading Exercise...

Data flow inside a network

Each connection between nodes has a weight — a numerical parameter determining how strongly information passes from one node to the next. During training, the network learns appropriate values for these weights through exposure to examples.

Understanding how information flows through a network clarifies how they work. Here’s what happens at a single node:

The node receives inputs from all nodes in the previous layer (or from raw data if it’s in the input layer).
Each input is multiplied by its corresponding weight associated with the connection. This determines how much influence each input has on the current node.
All weighted inputs are summed together, and a bias is added. The bias is a learnable parameter that allows the node to shift its activation threshold — essentially allowing the node to fire even when weighted inputs are small, or to require stronger input before activating.
The result passes through an activation function (such as ReLU, sigmoid, or tanh), which introduces nonlinearity into the model.
The output is then sent forward to all nodes in the next layer.

This process repeats at every node in every layer. Through repeated transformation across many nodes and layers, neural networks can represent extremely complex functions — far more complex than simple linear relationships or the pattern matching possible with n-gram models.

Loading Exercise...

Activation functions in plain words

An activation function determines what a node outputs after summing its weighted inputs and adding its bias. Common activation functions include (but are not limited to):

Identity function: Outputs the input unchanged.

ReLU (Rectified Linear Unit): Outputs max(0, x) — keeping positive values unchanged and zeroing out negative values.

Sigmoid function: Outputs inputs to values between 0 and 1, following an “S-shaped curve”.

The reason why activation functions matter is that they allow neural networks to learn curves, thresholds, interactions, and other complex patterns — i.e. nonlinearity. This gives neural networks their power to approximate arbitrarily complex functions.

Loading Exercise...

Example calculation

To make this concrete, let’s trace through a calculation in a minimal neural network with just three nodes:

1 input node
1 hidden node
1 output node

We’ll set weights as follows: 0.1 (input → hidden) and 0.3 (hidden → output). Both biases are set to 0 for simplicity. Activation functions: ReLU for the hidden node, identity for the output node.

Figure 2 — An example network with three layers, with one node per layer and specified weights and biases.

Let’s trace what happens when we input the value 5:

Step 1: Input to hidden layer

The input (5) is multiplied by the weight (0.1):
Add the bias (0):
Apply ReLU activation: ReLU (since 0.5 is positive, it passes through unchanged)

Step 2: Hidden to output layer

The hidden node’s output (0.5) is multiplied by the weight (0.3):
Add the bias (0):
Apply identity activation: identity

Final result: 0.15

This is a very simple example with just two connections, but the same principles apply when scaling to networks with millions or billions of parameters. The complexity comes from having many nodes operating in parallel within each layer, multiple layers in sequence, and learning appropriate weight values through training on large datasets.

Loading Exercise...

← Overview

Training a Neural Network →