Markov Chains and n-gram Models
Learning Objectives
- You understand what a Markov chain is and how it relates to language modeling.
- You know what n-gram language models are and how they work.
- You know examples of how n-gram language models have been used in practical applications.
Markov chains
A Markov chain is a mathematical system that transitions between states according to certain probabilities, where the probability of moving to the next state depends only on the current state — not on the full history of how the system arrived at that state. This property is called the Markov property or “memorylessness.”
This mathematical framework provides the theoretical basis for n-gram language models and many other probabilistic sequence models.
Formally, if are random variables representing states (for example, words in a sentence), the Markov property states that:
This equation expresses that to predict the future (the next state), you only need to know the present (the current state) — not the entire sequence of past states. This simplification makes the model computationally tractable while still capturing important sequential dependencies.
Consider this very simple corpus of text consisting of two sentences:
- “I like cats”
- “I like dogs”
From this text, we can construct a Markov chain where:
- At the start, we always see “I” (probability 1.0, since both sentences start with “I”).
- After observing “I”, we always see “like” next (probability 1.0, since this transition occurred 2 out of 2 times).
- After observing “like”, we might see “cats” (probability 0.5) or “dogs” (probability 0.5), since each occurred once out of two occurrences of “like”.
- After observing “cats” or “dogs”, the sentence ends.
A Markov chain can be visualized as a directed graph where each node represents a state (such as a word), and arrows between nodes represent possible transitions. Each arrow is weighted by the probability of that transition occurring given the current state. As an example, the above text can be represented as a Markov chain with states “I”, “like”, “cats”, and “dogs”, along with start and end states — this is visualized in Figure 1 below.
This simple model suffices to generate new sequences. Starting from “I”, we follow the arrows: first to “like” (the only option), then randomly to either “cats” or “dogs” based on their probabilities. This produces sentences like “I like cats” or “I like dogs” — which match our training data — but the model has learned a pattern it can apply probabilistically to generate outputs.
When applied to text, each “state” in a Markov chain can represent different linguistic units: individual letters, syllables, or words. By analyzing a corpus of text, we estimate the transition probabilities — how likely each state is to follow another. The resulting chain can then generate new text resembling the training data, or predict what is likely to come next in a sequence.
Shannon’s experiments with words
After experimenting with letters at the character level, as discussed in the previous chapter, Claude Shannon also conducted experiments generating text at the word level. His goal was demonstrating how the amount of context used dramatically affects how natural the generated text appears.
Shannon showed that:
- When words are chosen only by their overall frequency in the language (ignoring context entirely), the result is incoherent.
- When words are chosen while considering the context of recent words, the results become progressively more coherent as more context is included.
Choosing words purely by their frequency in English, similar to a unigram model, might yield output like:
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE
This text contains recognizable English words, but they are arranged in completely nonsensical order. There is no grammatical structure or semantic coherence — just a random jumble of common English words.
When the choice of each word depends on the single word that came just before it (similar to a bigram model), the text already appears much more English-like:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
While this still doesn’t make complete sense as a whole, you can observe fragments of grammatical structure emerging: “frontal attack on,” “the character of,” “another method for.” The local word sequences sound more natural, even though the overall meaning remains unclear and the text doesn’t cohere into a meaningful passage.
Whether working with letters, words, or even game moves (as in the rock-paper-scissors examples from earlier chapters), the underlying principle remains consistent: collect data → estimate probabilities → use those probabilities to predict the next step.
This simple but powerful approach forms the foundation of statistical language modeling.
Word n-gram language model
The word n-gram language model makes Shannon’s experimental intuition systematic and provides a practical framework for building language models.
Instead of the next word depending only on the current word (a first-order Markov chain), we allow the probability of the next word to depend on the last words. This provides more context and typically produces more coherent text.
The simplest n-gram model is the 1-gram (unigram) model. In this model, the probability of each word is based solely on how frequently that word appears in the training text. No context is considered — each word is treated as independent of all others.
Consider these three simple training sentences:
- “The capital of Finland is Helsinki”
- “The capital of Sweden is Stockholm”
- “The capital of Norway is Oslo”
Counting word frequencies across all three sentences (18 words total) yields a probability distribution:
| Word | Count | Probability |
|---|---|---|
| The | 3 | 3/18 ≈ 0.167 |
| capital | 3 | 3/18 ≈ 0.167 |
| of | 3 | 3/18 ≈ 0.167 |
| is | 3 | 3/18 ≈ 0.167 |
| Finland | 1 | 1/18 ≈ 0.056 |
| Helsinki | 1 | 1/18 ≈ 0.056 |
| Sweden | 1 | 1/18 ≈ 0.056 |
| Stockholm | 1 | 1/18 ≈ 0.056 |
| Norway | 1 | 1/18 ≈ 0.056 |
| Oslo | 1 | 1/18 ≈ 0.056 |
This distribution allows generating text by randomly selecting words in proportion to their frequencies. However, because word order and context are completely ignored, the results are typically incoherent jumbles. You might generate “Helsinki of capital The is” or “Sweden The Finland of Oslo” — grammatically nonsensical sequences where individual words are valid but their arrangement is meaningless.
Try generating unigram-based sentences below using text from Shakespeare’s Romeo and Juliet:
Notice how the individual words all come from the play, but they don’t form meaningful sentences or even grammatical fragments.
2-gram language model
The 2-gram (bigram) model represents a significant improvement by considering the previous word when predicting the next word. This captures basic sequential dependencies in language — patterns in which words typically follow other words.
Using the same example sentences, we can build a bigram model by counting word pairs:
- After “The”, the word “capital” always appears (3 out of 3 times).
- After “capital”, the word “of” always appears (3 out of 3 times).
- After “of”, we observe “Finland”, “Sweden”, or “Norway” (each appearing once, so probability 1/3 each).
- After “is”, we observe “Helsinki”, “Stockholm”, or “Oslo” (each appearing once, so probability 1/3 each).
This yields conditional probability distributions:
| Previous word | Probability distribution of next word |
|---|---|
| The | capital: 1.0 |
| capital | of: 1.0 |
| of | Finland: 1/3, Sweden: 1/3, Norway: 1/3 |
| Finland | is: 1.0 |
| Sweden | is: 1.0 |
| Norway | is: 1.0 |
| is | Helsinki: 1/3, Stockholm: 1/3, Oslo: 1/3 |
This model produces considerably more structured text because it respects word-to-word transitions observed in training data. Starting with “The”, we’ll always get “capital” next, then “of”, then one of the three countries, forming grammatically structured phrases.
However, the bigram model has fundamental limitations. It loses track of longer context — it has no memory beyond the immediately preceding word. For example, given “The capital of Finland is”, the model only sees “is” as context and might still predict “Stockholm” or “Oslo” (each with probability 1/3), even though “Finland” appeared earlier in the sentence. The model has no way to remember that we’re discussing Finland specifically, because it only conditions on the single previous word.
Generate bigram sentences using text from Romeo and Juliet:
You should notice that the text looks more natural than unigrams — there are recognizable short phrases and local grammatical structure — but longer coherence remains lacking.
3-gram language model
The 3-gram (trigram) model takes the previous two words into account when predicting the next word. This provides enough context to resolve many ambiguities that bigrams cannot handle.
For our example sentences, the trigram model captures sequences like:
| Previous two words | Probability distribution of next word |
|---|---|
| The capital | of: 1.0 |
| capital of | Finland: 1/3, Sweden: 1/3, Norway: 1/3 |
| of Finland | is: 1.0 |
| of Sweden | is: 1.0 |
| of Norway | is: 1.0 |
| Finland is | Helsinki: 1.0 |
| Sweden is | Stockholm: 1.0 |
| Norway is | Oslo: 1.0 |
Now the model correctly captures the relationship between countries and their capitals. Given “The capital of Finland is”, the model sees “Finland is” as the two-word context and knows with certainty that “Helsinki” should follow. It will never incorrectly predict “Stockholm” or “Oslo” after “Finland is”, because those sequences never appeared in training data.
Try generating trigram-based sentences using text from Romeo and Juliet:
The generated text should display better local coherence, with recognizable phrases and more consistent grammatical structure over longer spans.
Beyond 3-gram models
The approach naturally extends to higher-order n-grams:
- 4-grams condition on the previous 3 words
- 5-grams condition on the previous 4 words
- And so on for larger values of n
The more context words included, the richer and more specific the contextual information becomes. Larger values of n can capture longer-range dependencies and produce more coherent text that maintains consistency over longer spans.
However, there’s a fundamental trade-off: larger n means exponentially more possible n-gram sequences to track. For example, with a vocabulary of 10,000 words:
- Bigrams have 10,000 * 10,000 = 100 million possible combinations
- Trigrams have 10,000 * 10,000 * 10,000 = 1 trillion possible combinations
- 4-grams have 10 quadrillion possible combinations
Most of these combinations will never appear in any finite training corpus, making probability estimation increasingly difficult and unreliable. This is known as the data sparsity problem. Even with very large training corpora, most possible longer n-grams simply won’t occur, leaving the model unable to estimate their probabilities reliably. When the model encounters an n-gram it’s never seen during training, it has no basis for prediction.
Try generating 4-gram text below and observe both the increased coherence and the potential challenges:
Applications of n-gram models
Despite their limitations, n-gram models were historically important and widely deployed in practical applications before neural language models became dominant. Their applications included:
Speech recognition systems used n-gram models to predict likely word sequences from acoustic signals. When acoustic analysis suggests several possible words, the language model helps choose the most likely option based on context — favoring “recognize speech” over “wreck a nice beach” even if the acoustic signals are similar.
Machine translation systems employed n-gram models to select fluent translations. When multiple translations are grammatically possible, the language model favors those that sound more natural in the target language.
Text input prediction for keyboards and mobile devices uses n-gram models to suggest likely next words, enabling faster typing through autocomplete features.
Spelling correction systems use n-gram models to identify likely intended words when misspellings are detected. The model favors corrections that produce more probable word sequences.