Probabilistic Models and Language Models

Word n-gram Language Model

Learning Objectives

You know what the n-gram language model is.
You know of examples of how the n-gram language model has been used.
You know of some of the problems of the n-gram language model.

A word n-gram language model is a type of probabilistic language model for predicting the next word in a sequence of words. The model is constructed using a text database and it models the probability of a word given the previous n-1 words (hence, the word n-gram).

The model is strongly linked to the work of Shannon discussed earlier. Shannon also introduced the term n-gram.

1-gram language model

The simplest n-gram model is the 1-gram language model, which is also known as the unigram model. In the 1-gram model, the probability of a word is based solely on the frequency of the word in the text database. The model does not take into account the prior context of the word.

To construct a 1-gram language model, we count the number of times each word appears in the text database and divide the count by the total number of words. The resulting probability distribution can be used to predict the next word in a sequence of words.

Let’s consider the idea of building a simple 1-gram language model using the following three sentences.

“The capital of Finland is Helsinki”
“The capital of Sweden is Stockholm”
“The capital of Norway is Oslo”

In the above sentences, the word “The” has been used three times out of the eighteen words, similarly to “capital”, “of”, and “is”. The words “Finland”, “Helsinki”, “Sweden”, “Stockholm”, “Norway”, and “Oslo” are each used once.

As a probability distribution, this would be the following.

Word	Probability
The	3/18
capital	3/18
of	3/18
is	3/18
Finland	1/18
Helsinki	1/18
Sweden	1/18
Stockholm	1/18
Norway	1/18
Oslo	1/18

The above probability distribution would not be very useful for generating text, as there is no information on sequencing of words. The following text generation possibility highlights this — with the following, you can generate 1-gram language model based sentences using text from the Romeo and Juliet book.

Loading Exercise...

2-gram language model

The 2-gram language model takes the previous word into account when considering the probabilities.

In the above sentences, the word “capital” has been used three times after the word “The”, and the word “The” has been used a total of three times. Thus, the probability of the word “capital” given that the previous word was “The” is 3/3, or 1.

Doing the same for all the words, we get the following probability distribution. As some of the words do not appear after another word, there are also empty spaces.

Previous word	Probability distribution of next word
The	capital: 1.0
capital	of: 1.0
of	Finland: 1/3, Sweden: 1/3, Norway: 1/3
is	Helsinki: 1/3, Stockholm: 1/3, Oslo: 1/3
Finland	is: 1.0
Helsinki
Sweden	is: 1.0
Stockholm
Norway	is: 1.0
Oslo

Now, we know that the word “The” is always followed by the word “capital”, and the word “capital” is always followed by “of”. The word “of” is followed by either “Finland”, “Sweden”, or “Norway”. Similarly, we know that the word “is” is always followed by either “Helsinki”, “Stockholm”, or “Oslo”.

Although nice, this is still not very informative. Based on the above probability distribution, generating text to complete the sentence “The capital of Finland is ” would produce the word “Finland” with probability 1/3, the word “Sweden” with probability 1/3, and the word “Norway” with probability 1/3. This stems from the model only knowing the previous word, and not the previous two words.

With the following, you can generate 2-gram language model based sentences from text in the Romeo and Juliet book.

Loading Exercise...

3-gram language model

The 3-gram language model takes the previous two words into account. We can calculate the probability of the word “of” given that the previous two words were “The” and “capital”. We can calculate the probability of the word “Finland” given that the previous two words were “capital” and “of”. And so on. In this case, the probability distribution would look as follows.

Previous two words	Probability distribution of next word
The capital	of: 1.0
capital of	Finland: 1/3, Sweden: 1/3, Norway: 1/3
of Finland	is: 1.0
of Sweden	is: 1.0
of Norway	is: 1.0
Finland is	Helsinki: 1.0
Sweden is	Stockholm: 1.0
Norway is	Oslo: 1.0

Now, we know that the words “The capital” is always followed by the word “of”. The words “capital of” is always followed by either “Finland”, “Sweden”, or “Norway”. The words “of Finland” is always followed by the word “is”, and so on. Importantly, the model now knows that the words “Finland is” are always followed “Helsinki”.

With the following, you can generate 3-gram language model based sentences from text in the Romeo and Juliet book.

Loading Exercise...

Beyond 3-gram model

The n-gram language model can be extended to include more words in the context. The 4-gram model takes the previous three words into account, the 5-gram model takes the previous four words into account, and so on. The larger the n-gram, the more context the model has, and the more accurate the productions are likely to be.

The following allows you to generate 4-gram language model based sentences from Romeo and Juliet. When you compare the output with the 1-gram and 2-gram language models, you’ll likely see some differences in how sensible the sentences are.

Uses and problems

Overall, the n-gram model is simple yet efficient. It has been widely used in a range of tasks, including speech recognition, machine translation, and text generation.

In 2006, Google released a large dataset of n-grams. The dataset has more than one trillion tokens, and it includes n-grams of up to five words, where there are more than one billion 5-grams. They also offer a tool called Books Ngram Viewer that allows you to search for n-grams in a corpus of books and study how their usage has changed over time.

The n-gram model also has a handful of limitations. It can only consider a fixed number of previous words, which limits the context the model can use to predict the next word. As an example, a 5-gram model could be quite good at predicting the continuation for the sentence “The capital of Finland is” as it could only consider a fixed number of previous words to fill in the last word, i.e. “Helsinki”.

On the other hand, the model would struggle to correctly predict the continuation for the sentence “Finland is a country in Northern Europe, with the capital being”, as information about the country (Finland) would be quite far from the word capital.

Size-wise, the model can also grow quite large with the size of the vocabulary and the chosen n. Effectively, it suffers from the curse of dimensionality, as the number of possible n-grams grows with the size of the vocabulary and the chosen n.

Regarding the vocabulary, the model can also struggle with missing words (or out-of-vocabulary words). If the model has not seen a word (or an n-gram) during training, it will not be able to predict the next word in the sequence. One solution to this problem is to use Laplacian smoothing, where a small constant is added to all counts to ensure that no probability is zero.

Finally, the model also has little information about the context of the words. It can predict the next word based on the previous words, but it does not have information of the meaning of the words or the context of the sentence.

Loading Exercise...

Let’s next look into neural network -based large language models that address some of the limitations of the n-gram model.

← Towards Language Models

Neural Networks and Language Models →