Probabilities and Language

Markov Chains and n-gram Models


Learning Objectives

  • You understand what a Markov chain is and how it relates to language modeling.
  • You know what n-gram language models are and how they work.
  • You know examples of how n-gram language models have been used in practical applications.

Markov chains

A Markov chain is a mathematical system that transitions between states according to certain probabilities, where the probability of moving to the next state depends only on the current state — not on the full history of how the system arrived at that state. This property is called the Markov property or “memorylessness.”

This mathematical framework provides the theoretical basis for n-gram language models and many other probabilistic sequence models.

Formally, if 𝑋1,𝑋2,,𝑋𝑛 are random variables representing states (for example, words in a sentence), the Markov property states that:

𝑃(𝑋{𝑛+1}|𝑋1,𝑋2,,𝑋𝑛)=𝑃(𝑋{𝑛+1}|𝑋𝑛)

This equation expresses that to predict the future (the next state), you only need to know the present (the current state) — not the entire sequence of past states. This simplification makes the model computationally tractable while still capturing important sequential dependencies.

Consider this very simple corpus of text consisting of two sentences:

  • “I like cats”
  • “I like dogs”

From this text, we can construct a Markov chain where:

  • At the start, we always see “I” (probability 1.0, since both sentences start with “I”).
  • After observing “I”, we always see “like” next (probability 1.0, since this transition occurred 2 out of 2 times).
  • After observing “like”, we might see “cats” (probability 0.5) or “dogs” (probability 0.5), since each occurred once out of two occurrences of “like”.
  • After observing “cats” or “dogs”, the sentence ends.

A Markov chain can be visualized as a directed graph where each node represents a state (such as a word), and arrows between nodes represent possible transitions. Each arrow is weighted by the probability of that transition occurring given the current state. As an example, the above text can be represented as a Markov chain with states “I”, “like”, “cats”, and “dogs”, along with start and end states — this is visualized in Figure 1 below.

Figure 1 — A Markov chain representing the simple text corpus “I like cats” and “I like dogs”. Each arrow is labeled with the probability of that transition occurring.

This simple model suffices to generate new sequences. Starting from “I”, we follow the arrows: first to “like” (the only option), then randomly to either “cats” or “dogs” based on their probabilities. This produces sentences like “I like cats” or “I like dogs” — which match our training data — but the model has learned a pattern it can apply probabilistically to generate outputs.

Markov chains in language

When applied to text, each “state” in a Markov chain can represent different linguistic units: individual letters, syllables, or words. By analyzing a corpus of text, we estimate the transition probabilities — how likely each state is to follow another. The resulting chain can then generate new text resembling the training data, or predict what is likely to come next in a sequence.


Loading Exercise...

Shannon’s experiments with words

After experimenting with letters at the character level, as discussed in the previous chapter, Claude Shannon also conducted experiments generating text at the word level. His goal was demonstrating how the amount of context used dramatically affects how natural the generated text appears.

Shannon showed that:

  • When words are chosen only by their overall frequency in the language (ignoring context entirely), the result is incoherent.
  • When words are chosen while considering the context of recent words, the results become progressively more coherent as more context is included.

Choosing words purely by their frequency in English, similar to a unigram model, might yield output like:

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

This text contains recognizable English words, but they are arranged in completely nonsensical order. There is no grammatical structure or semantic coherence — just a random jumble of common English words.

When the choice of each word depends on the single word that came just before it (similar to a bigram model), the text already appears much more English-like:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

While this still doesn’t make complete sense as a whole, you can observe fragments of grammatical structure emerging: “frontal attack on,” “the character of,” “another method for.” The local word sequences sound more natural, even though the overall meaning remains unclear and the text doesn’t cohere into a meaningful passage.

Loading Exercise...

Underlying principle

Whether working with letters, words, or even game moves (as in the rock-paper-scissors examples from earlier chapters), the underlying principle remains consistent: collect data → estimate probabilities → use those probabilities to predict the next step.

This simple but powerful approach forms the foundation of statistical language modeling.


Word n-gram language model

The word n-gram language model makes Shannon’s experimental intuition systematic and provides a practical framework for building language models.

Instead of the next word depending only on the current word (a first-order Markov chain), we allow the probability of the next word to depend on the last 𝑛1 words. This provides more context and typically produces more coherent text.

The simplest n-gram model is the 1-gram (unigram) model. In this model, the probability of each word is based solely on how frequently that word appears in the training text. No context is considered — each word is treated as independent of all others.

Consider these three simple training sentences:

  • “The capital of Finland is Helsinki”
  • “The capital of Sweden is Stockholm”
  • “The capital of Norway is Oslo”

Counting word frequencies across all three sentences (18 words total) yields a probability distribution:

WordCountProbability
The33/18 ≈ 0.167
capital33/18 ≈ 0.167
of33/18 ≈ 0.167
is33/18 ≈ 0.167
Finland11/18 ≈ 0.056
Helsinki11/18 ≈ 0.056
Sweden11/18 ≈ 0.056
Stockholm11/18 ≈ 0.056
Norway11/18 ≈ 0.056
Oslo11/18 ≈ 0.056

This distribution allows generating text by randomly selecting words in proportion to their frequencies. However, because word order and context are completely ignored, the results are typically incoherent jumbles. You might generate “Helsinki of capital The is” or “Sweden The Finland of Oslo” — grammatically nonsensical sequences where individual words are valid but their arrangement is meaningless.

Try generating unigram-based sentences below using text from Shakespeare’s Romeo and Juliet:

Notice how the individual words all come from the play, but they don’t form meaningful sentences or even grammatical fragments.

Loading Exercise...

2-gram language model

The 2-gram (bigram) model represents a significant improvement by considering the previous word when predicting the next word. This captures basic sequential dependencies in language — patterns in which words typically follow other words.

Using the same example sentences, we can build a bigram model by counting word pairs:

  • After “The”, the word “capital” always appears (3 out of 3 times).
  • After “capital”, the word “of” always appears (3 out of 3 times).
  • After “of”, we observe “Finland”, “Sweden”, or “Norway” (each appearing once, so probability 1/3 each).
  • After “is”, we observe “Helsinki”, “Stockholm”, or “Oslo” (each appearing once, so probability 1/3 each).

This yields conditional probability distributions:

Previous wordProbability distribution of next word
Thecapital: 1.0
capitalof: 1.0
ofFinland: 1/3, Sweden: 1/3, Norway: 1/3
Finlandis: 1.0
Swedenis: 1.0
Norwayis: 1.0
isHelsinki: 1/3, Stockholm: 1/3, Oslo: 1/3

This model produces considerably more structured text because it respects word-to-word transitions observed in training data. Starting with “The”, we’ll always get “capital” next, then “of”, then one of the three countries, forming grammatically structured phrases.

However, the bigram model has fundamental limitations. It loses track of longer context — it has no memory beyond the immediately preceding word. For example, given “The capital of Finland is”, the model only sees “is” as context and might still predict “Stockholm” or “Oslo” (each with probability 1/3), even though “Finland” appeared earlier in the sentence. The model has no way to remember that we’re discussing Finland specifically, because it only conditions on the single previous word.

Generate bigram sentences using text from Romeo and Juliet:

You should notice that the text looks more natural than unigrams — there are recognizable short phrases and local grammatical structure — but longer coherence remains lacking.

Loading Exercise...

3-gram language model

The 3-gram (trigram) model takes the previous two words into account when predicting the next word. This provides enough context to resolve many ambiguities that bigrams cannot handle.

For our example sentences, the trigram model captures sequences like:

Previous two wordsProbability distribution of next word
The capitalof: 1.0
capital ofFinland: 1/3, Sweden: 1/3, Norway: 1/3
of Finlandis: 1.0
of Swedenis: 1.0
of Norwayis: 1.0
Finland isHelsinki: 1.0
Sweden isStockholm: 1.0
Norway isOslo: 1.0

Now the model correctly captures the relationship between countries and their capitals. Given “The capital of Finland is”, the model sees “Finland is” as the two-word context and knows with certainty that “Helsinki” should follow. It will never incorrectly predict “Stockholm” or “Oslo” after “Finland is”, because those sequences never appeared in training data.

Try generating trigram-based sentences using text from Romeo and Juliet:

The generated text should display better local coherence, with recognizable phrases and more consistent grammatical structure over longer spans.


Beyond 3-gram models

The approach naturally extends to higher-order n-grams:

  • 4-grams condition on the previous 3 words
  • 5-grams condition on the previous 4 words
  • And so on for larger values of n

The more context words included, the richer and more specific the contextual information becomes. Larger values of n can capture longer-range dependencies and produce more coherent text that maintains consistency over longer spans.

However, there’s a fundamental trade-off: larger n means exponentially more possible n-gram sequences to track. For example, with a vocabulary of 10,000 words:

  • Bigrams have 10,000 * 10,000 = 100 million possible combinations
  • Trigrams have 10,000 * 10,000 * 10,000 = 1 trillion possible combinations
  • 4-grams have 10 quadrillion possible combinations

Most of these combinations will never appear in any finite training corpus, making probability estimation increasingly difficult and unreliable. This is known as the data sparsity problem. Even with very large training corpora, most possible longer n-grams simply won’t occur, leaving the model unable to estimate their probabilities reliably. When the model encounters an n-gram it’s never seen during training, it has no basis for prediction.

Try generating 4-gram text below and observe both the increased coherence and the potential challenges:


Applications of n-gram models

Despite their limitations, n-gram models were historically important and widely deployed in practical applications before neural language models became dominant. Their applications included:

Speech recognition systems used n-gram models to predict likely word sequences from acoustic signals. When acoustic analysis suggests several possible words, the language model helps choose the most likely option based on context — favoring “recognize speech” over “wreck a nice beach” even if the acoustic signals are similar.

Machine translation systems employed n-gram models to select fluent translations. When multiple translations are grammatically possible, the language model favors those that sound more natural in the target language.

Text input prediction for keyboards and mobile devices uses n-gram models to suggest likely next words, enabling faster typing through autocomplete features.

Spelling correction systems use n-gram models to identify likely intended words when misspellings are detected. The model favors corrections that produce more probable word sequences.

Loading Exercise...