Probabilistic Models and Language Models

Towards Language Models

Learning Objectives

You know what language models are and know of their origins.

Language models are probabilistic models of natural language. They can be used, among other things, to determine the probability of a sequence of words and the probability of a word given a sequence of previous words.

The following quote from John Rupert Finch aptly summarizes the key idea of language models.

You shall know a word by the company it keeps.

Experiments with letters

Some of the early language models can be traced to the works of Andrey Markov and Claude Shannon, who both explored frequency of letters in text, although from different viewpoints.

Markov conducted experiments where he tracked the occurrence of letters in text, counting vowels and consonants. He looked for the number of times that a consonant was followed by a vowel or a consonant, and vise versa for vowels. He observed that letters in text are not randomly distributed, but have an underlying mechanism that affects their occurrence.

The following outlines an end result of counting occurrences of letters in text, going beyond counting vowels and consonants. The table highlighting the probability of a certain letter pair appearing in text. On the left hand side and on the top, you can see the letters. The cells in the array show the probability of a letter at the top occurring after the letter on the left.

The table has been constructed from the first 50000 characters of the Romeo and Juliet by William Shakespeare.

	e	t	o	a	i
e	0.0038	0.0055	0.0027	0.0067	0.0012
t	0.0047	0.0009	0.0059	0.0025	0.0034
o	0.0003	0.0036	0.0031	0.0004	0.0004
a	-	0.0070	0.0000	-	0.0030
i	0.0033	0.0066	0.0026	0.0008	0.0005

The array above shows that the letter “e” is most likely to be followed by the letter “a”, and the letter “t” is most likely to be followed by the letter “o”. Interestingly, in the first 50000 characters of the book, the letter “e” is never followed by the letter “a”. The same holds for the letter “a”, which is never followed by the letter “a”.

This effort was a part of Markov’s larger effort related to a theory of probability, where events were linked and the likelihood of the next event depended on the current state. This idea is now known as the Markov chain.

Loading Exercise...

Shannon, on the other hand, explored text generation by first generating sentences by picking letters randomly, and subsequently incorporated the probability of letters in text into text generation. The more complex the probabilistic model (or, the longer the sequences of prior characters were), the more the generated sentences looked like something sensible.

As an example, the following snippet shows a randomly generated string.

XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD

The following shows a snippet where the probability of the next character is based on the two previous characters.

IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE

Shannon also explored the notion that anyone who knows about the English language has implicit knowledge that allows filling in — or attempting to fill in — gaps in text. He demonstrated the notion through a game, where a person unfamiliar with a string of text would guess the letters one by one. If a letter was incorrectly guessed, the person would guess again, until the guess was correct, at which point the person would move to guessing the next character.

Try playing such a game below.

Shannon's Game

Clicking the "Make Guess" button opens up a prompt that asks for a character. Type in a character and press submit. If the guess is correct, you move to guessing the next character.

Correct guesses: 0 / 0

Passage so far: L - - - - - - -

Shannon’s efforts led to the publication of an article A Mathematical Theory of Communication, where he explored the notion of information and communication. The article is considered a foundational work in the field of information theory and communication theory.

Loading Exercise...

Experiments with words

As a part of the article A Mathematical Theory of Communication, Shannon also explored the possibility of generating text using words. The following sentence is created by choosing words independently but accounting for their probability in English text.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

Similarly, the following text is created by choosing words accounting for their probability to appear after the previous word (starting with the word “the”).

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

Through experiments, Shannon demonstrated that the generated text looked more like English text when the probability of the next word was based on the previous word.

Following the same idea, the button below allows creating new text using words from Romeo and Juliet, again focusing on the first 50000 characters of text. The concept is the same as Shannon worked on — we start with a random word, and then pick next words based on the likelihood of the word being followed by the previous word.

Loading Exercise...

Underlying principles

When considering the above experiments with letters and words, the underlying principles are very much the same. Using data, we can construct a probabilistic model, which we can use to predict the likely next event — regardless of whether the data contains rock paper scissors moves, letters, or words.

Next, we’ll look in a bit more detail into a language model that is based on words, the word n-gram language model.

← Probability Distributions and Probabilistic Models

Word n-gram Language Model →