Probabilities and Language

Limitations of Early Models

Learning Objectives

You know the main limitations of early statistical language models (Markov chains, n-grams).
You understand why these limitations motivated the development of neural language models.
You can explain challenges such as fixed context windows, data sparsity, and lack of semantic understanding.

Why early models fall short

Early approaches like Markov chains and n-gram models represented historically important steps toward language modeling. They demonstrated that using probability calculations based on observed patterns could generate text resembling human language in its surface structure. For specific tasks with limited scope — such as predicting the next word in highly predictable contexts — they could perform adequately.

However, these models had fundamental limitations that prevented them from scaling to meet the complex demands of modern natural language processing applications like machine translation, dialogue systems, document summarization, or question answering. Understanding these limitations helps explain why the field eventually shifted toward neural approaches.

Limitation: Fixed context window

Markov chains and n-gram models rely on a fixed, limited window of previous words to make predictions. This constraint is built into their fundamental architecture rather than being an incidental implementation detail.

A bigram model considers only the last word.
A trigram model considers only the last two words.
A 5-gram model considers only the last four words.

This creates serious problems when information crucial for the current prediction is located farther back in the text. Language frequently contains long-range dependencies — relationships between words separated by many intervening words.

Consider this example:

“Finland is a country in Northern Europe known for its education system and natural beauty. Its capital is …”

The word “Finland” appears at the beginning of the sentence, but it’s crucial for predicting that the capital is “Helsinki.” A 5-gram model examining only “beauty. Its capital is” would have no access to “Finland” and would lack the critical information needed for accurate prediction. The model might guess any capital city with equal probability.

This fixed context window becomes increasingly problematic for:

Long sentences with complex grammatical structure
Paragraphs where information from earlier sentences matters for understanding later ones
Any text requiring understanding of relationships that span significant distances

Human language routinely creates dependencies across much longer distances than n-gram models can capture. A pronoun might refer to a noun mentioned several sentences earlier. An answer to a question might require information from paragraphs above. These patterns are beyond the reach of fixed-window models.

Loading Exercise...

Limitation: Data sparsity

As increases (as we examine longer sequences), the number of possible n-gram combinations grows explosively. This exemplifies the curse of dimensionality — a problem arising when working in high-dimensional spaces where the volume of possible combinations vastly exceeds what can be adequately sampled.

Consider the mathematics with a vocabulary of 10,000 words:

There are 10,000 possible unigrams.
There are 10,000 * 10,000 = 100 million possible bigrams.
There are 10,000³ = 1 trillion possible trigrams.
There are 10,000⁴ = 10 quadrillion possible 4-grams.

Even with large training corpora, most longer n-grams will never appear in the training data, despite representing perfectly valid sequences:

A corpus of several million words might contain only thousands of unique trigrams actually observed.
But millions or billions of valid trigrams exist in the language that could theoretically occur.
The vast majority of possible n-grams will have zero observed occurrences.

This sparsity creates serious problems:

The model assigns zero probability to unseen sequences, even if they are grammatically correct and semantically meaningful. If “chocolate birthday cake” never appeared in training data but “vanilla birthday cake” did, the model cannot generalize between them.
Probability estimates become increasingly unreliable as n grows, because most n-grams have very few or zero observations.
Obtaining reliable estimates for larger n would require exponentially more training data — an impossible requirement that grows faster than any feasible data collection.

Techniques like smoothing (such as add-one or Laplace smoothing, which redistributes small amounts of probability mass to unseen events) partially address this by preventing zero probabilities. However, these are crude approximations that don’t solve the fundamental problem. The model still lacks any principled understanding of why certain sequences should be probable — it can only mechanically adjust counts.

Loading Exercise...

Limitation: Out-of-vocabulary words

Early statistical models depend entirely on the vocabulary present in their training data. They construct probability tables based exclusively on words they’ve observed.

This creates the out-of-vocabulary (OOV) problem:

If a word never appeared during training, the model literally cannot process it — no entry exists in its probability tables.
This particularly affects rare words, proper names, technical terms, neologisms, and newly coined expressions.
Names of people, places, products, or organizations absent from training data become impossible to handle correctly.

Practical systems typically addressed this limitation by:

Replacing rare words (below some frequency threshold) with a special UNK (unknown) token during both training and inference.
This allows the system to function without crashing, but at the cost of losing important information and reducing practical usefulness.

For example, if someone asks “What is the capital of Finand?”, with a typo, and “Finand” wasn’t in the training vocabulary, the model would effectively see “What is the capital of UNK?” and lose the crucial information needed to answer correctly. The model might respond based on patterns following “capital of UNK” in general, producing answers unrelated to the actual question.

Limitation: Lack of semantic understanding

N-gram and Markov models capture surface-level statistical patterns without any representation of underlying meaning, grammatical principles, or world knowledge. They are purely mechanical frequency counters — tallying sequences without comprehension of what those sequences mean or represent.

Consider the implications:

An n-gram model knows that “The capital of Finland is …” is statistically often followed by “Helsinki” in its training data, but it doesn’t know why this is true.
The model has no concept that Finland is a country, that countries have capitals, or that Helsinki is a city located within Finland’s borders.
It cannot distinguish between factually correct (“Helsinki”) and factually incorrect (“Stockholm” or “Paris”) continuations.

This lack of understanding makes n-gram models fundamentally brittle:

Generated text may appear grammatically plausible locally while containing factual errors or logical contradictions.
The model cannot reason about what makes sense in context — it can only report observed patterns.
It cannot transfer knowledge between related contexts. Knowing that “France’s capital is Paris” provides no help with “The capital of France is…” because these are treated as completely separate patterns rather than expressions of the same underlying fact.
The model treats all sequences as arbitrary symbol patterns, with no representation of the real-world relationships those symbols describe.

Loading Exercise...

Limitation: Storage and computational efficiency

Storing all n-grams for large and large vocabularies requires enormous memory, quickly becoming impractical.

Consider the theoretical storage requirements: A 5-gram model with a vocabulary of 100,000 words would have:

possible 5-gram sequences. That’s 10 septillion possible combinations — far more than could be observed in any corpus and far more than could be stored with any conceivable computing infrastructure.

Even after pruning to keep only observed n-grams, storage requirements for large-scale models remain substantial:

Google’s Web 1T 5-gram corpus (based on 1 trillion words of web text) requires hundreds of gigabytes of storage.
Looking up probabilities in these massive tables incurs computational expense — each query requires searching through enormous data structures.
The model provides no compression or generalization — each n-gram must be stored and retrieved individually as a separate entry.

This inefficiency limits practical deployment of large n-gram models, especially in resource-constrained environments like mobile devices where memory and processing power are limited.

Broader limitations across early AI approaches

The limitations of n-gram models parallel challenges faced by other early AI approaches discussed in previous chapters, suggesting these were not isolated problems but symptoms of fundamental inadequacies in the approaches themselves.

Symbolic AI and expert systems, discussed in the part on AI and Generative Models, encountered similar problems:

As systems grew, the number of rules or knowledge base entries became unmanageable. Maintaining consistency across thousands of explicit rules proved extremely difficult, as adding new rules could interact unpredictably with existing ones.
Systems failed catastrophically when inputs fell outside their carefully defined rule base or training data. They couldn’t generalize or adapt to novel situations not explicitly anticipated by designers.
Both symbolic rules and n-gram probabilities lacked grounding in meaning, common sense, or world knowledge. They manipulated symbols or counted patterns without any representation of what those symbols meant or why those patterns existed.

These parallel limitations across different AI paradigms suggested that the fundamental approaches — whether rule-based symbolic systems or statistical-but-shallow pattern matching — were insufficient for capturing the richness and flexibility of human language and intelligence. Something qualitatively different was needed.

Historical turning point

By the late 2000s and early 2010s, researchers increasingly turned to neural models to overcome the brittleness and data limitations of n-grams. Initial neural language models used feed-forward and recurrent network architectures. Later, the introduction of the transformer architecture (2017) in the paper “Attention Is All You Need” led to dramatic improvements in both performance and scalability. These neural models — particularly transformers — became the foundation of today’s large language models like GPT, BERT, and their successors, which power modern applications from conversational AI to document analysis.

Loading Exercise...

← Markov Chains and n-gram Models

Summary →