Towards Language Models
Learning Objectives
- You know what language models are and know the origins of their core ideas.
- You can explain, at a high level, how simple probabilistic models predict the next letter or word.
- You understand how early experiments with letters and words established foundations for modern language models.
Probabilistic models of natural language
Language models are probabilistic models of natural language. They use probability theory to capture patterns in how language works by analyzing large amounts of text data. Specifically, they estimate quantities such as:
- The probability of a sequence of words: — “How likely is this particular sentence to occur?”
- The probability of the next word given previous words: — “Given the words we’ve seen so far, what word is likely to come next?”
For example, after observing the words “The cat sat on the,” a language model would assign high probability to continuation words like “mat,” “floor,” or “chair,” and very low probability to words like “jumped” or “telescope.” This reflects learned statistical patterns about which words typically follow others in natural language.
Linguist J. R. Firth captured the fundamental intuition underlying this approach:
You shall know a word by the company it keeps.
In other words, a word’s meaning and usage can be understood by examining the contexts in which it typically appears. Language models operationalize this principle by learning statistical patterns from large amounts of text — discovering which words typically occur together, which sequences are common, and which are rare.
Experiments with letters
Some of the earliest ideas behind language models trace to Andrey Markov (early 1900s) and Claude Shannon (1940s-1950s). Both studied how letters in text are not independent but instead depend on what preceded them. This represented a crucial insight: you cannot treat each letter as a random, isolated event if you want to understand or generate realistic text.
In English, the letter “q” almost always precedes “u,” and “h” frequently follows “t” but rarely follows “x.” By counting how often one letter follows another in actual text, clear patterns emerge. The table below is built from the first 50,000 characters of Shakespeare’s Romeo and Juliet. Rows represent the current letter, and columns represent the next letter. Each cell shows the probability that the column letter appears after the row letter.
| e | t | o | a | i | |
|---|---|---|---|---|---|
| e | 0.0038 | 0.0055 | 0.0027 | 0.0067 | 0.0012 |
| t | 0.0047 | 0.0009 | 0.0059 | 0.0025 | 0.0034 |
| o | 0.0003 | 0.0036 | 0.0031 | 0.0004 | 0.0004 |
| a | - | 0.0070 | 0.0000 | - | 0.0030 |
| i | 0.0033 | 0.0066 | 0.0026 | 0.0008 | 0.0005 |
Explore the table by selecting different letters. Some pairs like “et” or “at” occur very frequently, while others like “ae” or “aa” almost never appear. The crucial point is that meaningful structure emerges simply from counting patterns in data. You don’t need to understand grammar rules or semantic meaning — frequency statistics alone capture substantial structure about how language works.
To further illustrate this, using the search function of your browser, try first to search “et” on this page. Then, try to search for “ae”. Certain letter combinations are much more common than others!
Claude Shannon’s experiments demonstrated the power of statistical patterns in language through influential text generation experiments. Shannon explored generating text by:
- First, selecting letters completely at random, ignoring all structure and producing pure nonsense.
- Then, choosing letters based on their observed likelihood of following previous letters, using statistics like those shown in the visualization above.
Random letters with no structure produce output like:
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD
This is completely unreadable and resembles no natural language.
When each letter is chosen considering even minimal preceding context — such as which letter just appeared — the results begin resembling English, even though the “words” remain meaningless:
IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE
Notice this second example contains more plausible letter combinations. You can almost perceive English words forming, even though most are nonsensical. This demonstrates that capturing even simple statistical dependencies between letters produces text resembling natural language in its surface structure.
Shannon also proposed a guessing game to demonstrate how humans use implicit knowledge of language structure to predict what comes next. In this game, you attempt to guess each successive letter in hidden text. Your ability to make accurate guesses — often predicting the correct letter on the first or second attempt — reveals how much statistical structure exists in language and how thoroughly humans internalize these patterns. Try it:
Shannon's Game
Clicking the "Make Guess" button opens up a prompt that asks for a character. Type in a character and press submit. If the guess is correct, you move to guessing the next character.
Correct guesses: 0 / 0
Passage so far: L - - - - - - -
As you play, notice certain letters prove much easier to predict than others. After “th,” you might correctly guess “e” (forming “the”), but after an uncommon combination, the next letter becomes considerably harder to predict. When context provides strong constraints, prediction becomes easier; when context is less informative, uncertainty increases.
These experiments in part contributed to Shannon’s groundbreaking work A Mathematical Theory of Communication (1948), which established mathematical foundations of information theory. This work introduced fundamental concepts including entropy (measuring uncertainty or information content) and redundancy in language (the degree to which language contains predictable patterns).
Shannon demonstrated that English text is highly redundant — much of it is predictable from context. This explains why compression algorithms work effectively on text and why humans can often understand messages even with missing letters. The predictability isn’t accidental — it’s an inherent property of how language conveys meaning efficiently while remaining robust to errors.
From letters to language models
These early experiments revealed a profound insight: even simple statistical rules based on letter frequencies can produce text that resembles natural language in its surface structure. The key principle is that language contains statistical regularities that can be captured and exploited for both prediction and generation.
This approach scales. If counting letter pairs produces somewhat English-like text, then counting word pairs, word triples, or longer sequences should produce increasingly realistic results. This progression — from letters to words to longer sequences — traces the path that led from early experiments to n-gram models and eventually to modern neural language models.
The fundamental idea remains consistent across this evolution: language exhibits statistical patterns, and models that capture these patterns can predict and generate language-like sequences. What has changed is the sophistication of the models and the scale of data they learn from.