Neural Networks and Language Models

Neural Networks

Learning Objectives

You know the term neural network and have a high-level understanding of it.

The n-gram language models are efficient in a number of ways, but they have limitations. One of the limitations we discussed earlier is that they are limited in the length of the sequences they can handle.

To address this limitation, researchers looked into other ways of working with text data, where one of the prominent approaches has been to use neural network-based models.

Only building a high-level view

In this course, the aim is not to form an in-depth understanding of how neural networks work. Instead, the objective is that you form a high-level abstract understanding of key concepts, including neural networks, layers, weights, embeddings, and how sequential data could be handled with neural networks.

For an in-depth view of the topic, we recommend looking into the book Dive into Deep Learning.

Neural networks

Neural networks are a class of machine learning models. They are composed of layers of nodes, where each node is connected to the nodes in the previous and next layers. The first layer is the input layer, the last layer is the output layer, and the layers in between are called hidden layers.

Machine learning is a an area in artificial intelligence focusing on the development of algorithms that allow computers to learn from and make predictions or decisions based on data.

The following picture shows a neural network with one node in the input layer (with green outline), two hidden layers with two nodes each (with black outline), and one node in the output layer (with red outline).

A neural network with four layers. One input layer, two hidden layers, and one output layer.

Input to neural networks is given in numeric format through the input layer. Each node in the input layer corresponds to one numeric input value. The network processes the input by passing the numeric input through the hidden layers to produce an output. Each node in the output layer corresponds to one numeric output value.

The connections between the nodes have multipliers (called weights) that are used to adjust values passed from one node to the next.

Loading Exercise...

Data processing at each node

Each node in the hidden layer and the output layer can perform operations on the incoming values.

The value coming in to a node is often determined by calculating a weighted sum of the inputs (the sum of the inputs multiplied by the weights), and adding a constant to the weighted sum. The constant is called a bias — it is used as an additional way to adjust the output from the node.

Loading Exercise...

The operation done in the node to the incoming value depends on an activation function, which is given the input (calculated from the weighted sum and the bias). The activation function then returns the output of the node, which is passed to the next node. There exists a range of activation functions, where two of more popular ones include (rectified linear unit or ReLU) that returns the value 0 if the input is negative, otherwise the function returns the input, and identity function that just returns the input.

In summary, for each node, (1) the weighted sum of the inputs is calculated by multiplying the input values by the weights and summing the results, (2) the bias is added to the weighted sum, (3) the result is passed to an activation function, (4), the activation function returns the output of the node, and (5) the output is passed to the next node.

Loading Exercise...

As an example, the following neural network has one node in the input layer, one hidden layer with one node, and one node in the output layer. The weights of the connections are 0.1 and 0.3.

Assuming that the bias is zero and the activation function is ReLU, the output of the above neural network for an input value 5 would be calculated as follows.

First, the value 5 would be multiplied by 0.1, resulting in 0.5. The weighted sum would be 0.5. As the bias is zero, the weighted sum would be passed to the activation function, which would return the value 0.5. This value would be the output of the node, and it would be passed to the output layer.

Then, the value 0.5 would be multiplied by 0.3, resulting in 0.15. Now, this again would be handled by the activation function, which would produce the final output. If the activation function would be the identity function, the final output would be 0.15. Finally, the final observable output for the input value 5 would be 0.15.

While the above example is intentionally simple, neural networks can and often do consist of multiple nodes and layers.

Loading Exercise...

Training a neural network

Neural networks are trained on data that contains inputs and expected outputs. During training, the weights and biases of the network are adjusted to minimize the difference between the predicted output of the neural network and the actual output.

The training involves two steps, forward propagation and backpropagation. In forward propagation, the current weights and biases of the network are applied to the input data to calculate the output (like we did above with the input value 5, which resulted in the output value 0.15).

In backpropagation, the difference between the calculated output and the expected output is measured, and the weights and biases are adjusted. There is a bit more to it, but in essence, the weights and biases are adjusted to minimize the difference between the calculated output and the expected output.

Loading Exercise...

← Overview

Embeddings and Word Embeddings →