Probabilities and Language

Probabilities and Predictions


Learning Objectives

  • You understand and can apply the concept of probability.
  • You know that computers can make predictions based on data.
  • You know about the law of large numbers.
  • You know the concepts of probability distributions and probabilistic models.

Rock-paper-scissors

Rock-paper-scissors is a hand game where two players simultaneously choose one of three options: rock, paper, or scissors. The rules form a cycle: rock beats scissors by crushing it, scissors beats paper by cutting it, and paper beats rock by covering it. When both players choose the same option, the result is a draw.

You can play against the computer below. The scoreboard tracks points for both you and the computer. After you select an option, the computer makes its choice, and the winner is determined according to the rules.

You: 0
Computer: 0

Finger-flashing games

Rock-paper-scissors belongs to a long tradition of finger-flashing games played across many cultures. The earliest known references come from Egypt’s Beni Hasan burial site, dating to approximately 2000 BC. Similar games have emerged independently in numerous societies throughout history, suggesting these simple competitive patterns appeal across cultural boundaries.


Probability

In the rock-paper-scissors game above, the computer chooses randomly with equal likelihood for each option. Over many rounds, a pattern emerges: you win approximately one-third of games, lose approximately one-third, and draw approximately one-third. This occurs because when both players choose randomly with equal probabilities, each outcome becomes equally likely.

Probability quantifies the likelihood of an event occurring. It ranges from 0 (the event cannot happen) to 1 (the event will certainly happen). A probability of 0.5 indicates a 50% chance — the event is equally likely to occur or not occur. Higher probabilities indicate greater likelihood.

For example, the probability of rolling a 3 on a fair six-sided die is 1/6 (approximately 0.167), because one favorable outcome exists among six equally likely possibilities. The probability of drawing a heart from a standard 52-card deck is 13/52 = 0.25, because 13 of the 52 cards belong to the hearts suit.

Loading Exercise...

Now consider a biased version where the computer’s choices are not equally probable. In the next game, the computer plays rock with probability 0.7, scissors with probability 0.2, and paper with probability 0.1. This means rock appears far more frequently than paper — about seven times out of ten on average.

Understanding this probability distribution allows you to adapt your strategy. Since rock appears 70% of the time and paper beats rock, choosing paper frequently gives you a significant advantage over many games. Try this strategy and observe how your win rate changes:

You: 0
Computer: 0

Loading Exercise...

Predictions and artificial intelligence

Knowledge of probabilities enables prediction — making informed estimates about what will likely happen next. If you know the computer plays rock 70% of the time, you can predict rock is probably coming and plan your response accordingly.

Prediction means making an informed estimate about future events based on available information. Predictions express likelihood rather than certainty — they indicate what is probable, not what is guaranteed.

Humans make predictions constantly: estimating tomorrow’s weather from current atmospheric conditions, forecasting stock market movements from economic indicators, or anticipating sports outcomes from team performance records. Computers can perform similar reasoning, and this ability to make data-based predictions forms a core component of many AI systems.

Many AI systems rely fundamentally on prediction to perform tasks such as forecasting future events, making decisions under uncertainty, or translating between languages.

The next version uses an AI that attempts to predict your moves by analyzing your playing history.

You: 0
Computer: 0

The AI maintains a record of your move history and searches for patterns in your behavior. Based on discovered patterns, it makes informed guesses about your next choice. For instance, if you frequently play rock after winning with paper, the AI detects this tendency and exploits it by playing paper to counter your expected rock.

This may seem mysterious, but the mechanism is straightforward: counting events and their sequences. Patterns that appear frequently in your history are judged more likely to repeat than rare ones. The AI doesn’t “understand” the game in any deep sense — it simply identifies statistical regularities in your choices and uses them to inform predictions.

Loading Exercise...

The following version makes the AI’s learning process transparent. You can open a table showing how many times each three-move sequence has occurred in your playing history. Play several rounds, then examine the table. Do certain patterns appear more frequently than others? Can you identify regularities or habits in your own play that you weren’t consciously aware of?

You: 0
Computer: 0

As you play more games, the table accumulates more data, and the AI’s predictions become more confident. This frequency data is precisely what the AI uses for prediction — it examines recent patterns and estimates your next move based on what you’ve previously done in similar situations.

Loading Exercise...

Law of large numbers

A fundamental principle in probability theory is the law of large numbers. It states that as an experiment is repeated many times, the observed frequency of outcomes converges toward the true underlying probabilities.

Over many trials, the average result approaches the expected value, increasingly reflecting the actual probabilities governing the system.

Coin tossing provides a classic illustration. A fair coin has probability 0.5 for heads and 0.5 for tails. If you toss it only 10 times, you might observe 7 heads (70%) and 3 tails (30%) — substantially different from the expected 50-50 distribution. This deviation is normal with small sample sizes. However, if you toss the coin thousands of times, the observed proportions will converge closer and closer to 50% heads and 50% tails. The law of large numbers mathematically guarantees this convergence.

Simulate coin tosses

The same principle applies to rock-paper-scissors. If you play many rounds against a computer, the distribution of its choices will eventually reflect the true underlying probabilities more accurately.

Simulate rock paper scissors

For comparison, below, the computer’s probabilities for rock, paper, and scissors are not equal. When you run many rounds, observe how the observed proportions converge toward the set probabilities. Try running multiple simulations and notice how results stabilize with larger sample sizes.

Simulate rock paper scissors

This principle proves crucial for AI and machine learning: more data enables better estimation of true probabilities, which in turn produces more accurate predictions.

Loading Exercise...

Probability distributions

As you accumulate observations, your results approximate the underlying probabilities more closely. What you’re estimating is the probability distribution of the outcomes.

A probability distribution is a mathematical function describing the likelihood of different possible outcomes in a system. It specifies how probability is allocated across all possible events.

We can construct probability distributions from observed data. Suppose you observe a player’s sequence of 10 moves:

Playing history
rock rock rock paper rock rock paper scissors paper scissors

Counting the frequency of each choice: rock appears 5 times, paper appears 3 times, and scissors appears 2 times. Dividing each count by the total number of moves (10) yields the estimated probability distribution:

ChoiceProbability
rock0.50
paper0.30
scissors0.20

This distribution already provides strategic insight. Since rock appears most frequently (50% of the time), choosing paper becomes a strong response — you would win approximately half the time if this pattern continues.

The law of large numbers reminds us that estimation accuracy improves with sample size. With only 10 moves, these probability estimates may be unreliable — the player might have happened to choose rock frequently in this particular sequence without it representing their typical behavior. With 1,000 moves, the estimates would be much more trustworthy, as random fluctuations average out over larger samples.

Loading Exercise...

Conditional probability distributions

We can extend this analysis by considering how choices depend on previous actions. This produces a conditional probability distribution, which captures dependencies between events.

A conditional probability distribution describes the likelihood of outcomes given that a specified condition has been met. Rather than asking “what usually happens?” we ask “what usually happens after X?”

Instead of only counting overall moves, we track what follows what. Examining the same 10-move history:

  • After playing rock (which occurred 5 times), the next choice was rock 3 times and paper 2 times.
  • After playing paper (which occurred 3 times), the next choice was rock once and scissors twice.
  • After playing scissors (which occurred once), the next choice was paper.

We can organize this information into a conditional probability table:

Previous choiceNext choice probabilities
rockrock: 0.60, paper: 0.40, scissors: 0.00
paperrock: 0.33, paper: 0.00, scissors: 0.67
scissorsrock: 0.00, paper: 1.00, scissors: 0.00

This provides additional predictive power beyond simple frequency counts. We now know not just what usually happens but what usually happens after specific preceding events. If you just played rock, this table indicates you’re more likely to play rock again (60% probability) than paper (40% probability), and you never follow rock with scissors in this sample.

Patterns in longer sequences

More sophisticated patterns emerge when examining longer histories — considering the previous two or three moves rather than just one. This is exactly what the AI in the game does: it constructs conditional distributions from your past moves, analyzing sequences like “rock-paper-?” or “scissors-rock-paper-?” to predict what comes next. Longer sequences can enable more specific predictions, though they also require substantially more data to estimate reliably. With limited data, longer sequences may have occurred too infrequently to provide useful probability estimates.


Loading Exercise...

Probabilistic models

A probabilistic model uses information about past events to estimate the likelihood of future ones. These models range from simple (like frequency tables counting past occurrences) to highly complex (like the neural networks underlying large language models), but all rely fundamentally on probability and conditional probability distributions.

In the rock-paper-scissors game, the AI constructs a conditional probability distribution of your moves from your playing history and uses this as a probabilistic model to predict your next choice. When you’ve just played the sequence rock-paper, the model consults its table to determine what you typically do after rock-paper, then predicts the most likely outcome.

Probabilistic models appear throughout modern technology. Weather forecasts predict tomorrow’s temperature based on current atmospheric conditions and historical patterns. Stock market analysis predicts price movements from historical data and current economic indicators. Medical diagnosis systems predict diseases from observed symptoms and known disease patterns. Most importantly for this course, language models predict the next word in a sequence based on the words that came before. Language models represent sophisticated probabilistic models that we’ll examine in detail in subsequent chapters.

Loading Exercise...