Neural Networks and Language Models

Embeddings and Word Embeddings


Learning Objectives

  • You know the terms embedding and word embedding.

Embeddings

Neural networks take numeric data as input. To process non-numeric data, such data needs to be converted into numeric format. This conversion process is called embedding. Embeddings are numerical representations of data.

As an example, for the rock paper scissors game that we played earlier, we could represent each possible move as a number. Rock could be represented as 0, paper could be represented as 1, and scissors could be represented as 2. Such representation could be problematic, however, as it would imply that there is some sort of ranking between the moves.

An alternative approach would be to represent each move using a vector of three numbers, where one of the numbers would be 1 and the others would be 0. As an example, the vector [1, 0, 0] could represent rock, [0, 1, 0] paper, and [0, 0, 1] scissors.

If we would want to create a neural network that would take such a move as an input and produce an output, the input layer would have three nodes — one for each of the numbers in the vector — and the output layer would have three nodes — one for each of the possible moves.

One example of such a network could be as follows. The following neural network has three nodes in the input layer, two hidden layers with four nodes each, and three nodes in the output layer.


Loading Exercise...

Word embeddings

Some words share a meaning, while other words can have different meanings depending on the context. As an example, the words “understand” and “comprehend” have a similar meaning, while the words “understand” and “eat” have a different meaning.

Due to this, using a single number (or a vectorized representation like above) to represent words would imply that words cannot share meanings and that all words have an equal “distance” to other words.

To learn how words relate to each others, and to provide this information for language models, researchers have looked into creating word embeddings. Word embeddigs are, as the name implies, embeddings for words or numeric representations of words. They are a bit more complex than the vectors used for the rock paper scissors game, where each vector — e.g. [0, 0, 1] explitly separated a choice from other choices.

Word embeddings are numeric representations of words that encode information about the meaning of the words. They are constructed by training a neural network to learn a task related to the words, such as learning to predict the next word or to fill in a gap in text.

Learning word embeddings involves text, such as “i understand the meaning of the word” and “i comprehend the meaning of the word”. During training, the neural network could be given the task to learn to fill in a gap in sentences such as “i (gap) the meaning of the word”, learning to replace “(gap)” with the correct word.

Through training, the neural network would learn that the words “understand” and “comprehend” are similar, while the words “understand” and “eat” are not.

Loading Exercise...

Learning word embeddings

Currently, most common approaches for learning word embeddings involves neural networks. One possibility would be to have a node in the input layer for each word in the data, then have a hidden layer with a given number of nodes, and then have an output layer.

Learning word embeddings would involve using the neural network to learn a task related to the words, such as learning to predict the next word. When the neural network is trained, it learns the weights, and in doing so, it would make the weights of words that are similar to each others closer to each other when compared to words that are non-similar.

Once the training has been finished, the neural network has a set of learned weights for each input node, going to the nodes in the hidden layer. These weights can be used to represent the words — the resulting weights are the word embeddings. If there would be, for example, three nodes in the hidden layer, then each word would have three numbers representing it.

Now, the distance between each word embedding (e.g. calculated using cosine similarity) could be used to determine how similar the words are to each other.

As an example of a method for building word embeddings, see Word2Vec. Word2Vec is a set of methods for learning word embeddings from text data, focusing on learning to predict the next word in a sequence and on predicting the word given the surrouding context.