Probabilistic Models and Language Models

Probability Distributions and Probabilistic Models

Learning Objectives

You know of the law of large numbers.
You know of the concepts of probability distribution and probabilistic models.

In the previous part, you played a handful of games of rock paper scissors. At the end, you had access to a rock paper scissors game that showed a table that shows how many times you had chosen each three option combinations. This table was used by the computer to make predictions about your next move.

Here, we’ll look in more detail into how the computer could use that information to decide what to play against you.

Law of large numbers

A key concept in making informed choices in the long run is the law of large numbers. The law of large numbers dictates that the underlying probabilities of events in a system will start to emerge when the system is used for a large amount of times.

Law of large numbers states that the average of a large number of trials will be close to an expected value, reflecting the actual probabilities in the data.

Let’s consider this using an example of coin tossing. When tossing a fair coin, the probability of getting heads is 0.5, and the probability of getting tails is 0.5. If you toss the coin a number of times, you can count the averages. As an example, if you toss a coin for 10 times, you might get 7 heads and 3 tails. The average of getting heads in these tosses would be 0.7, while the average for getting tails would be 0.3.

The more times you toss the coin, the closer the average should be to 0.5. Below, you have the possibility of having a computer toss a coin for a number of times. When you click on one of the options, the computer tosses a coin for the given number of times, and shows the averages for heads and tails. As you notice, if you toss the coin for a large number of times, the averages should be closer to 0.5.

Simulate coin tosses

The same holds for playing the rock paper scissors game. If you play the game for a long time, the computer’s choices should reflect the probabilities that were set for the computer. The more games you play, the closer the computer’s choices should be to the probabilities set for the computer.

The following can be used to visualize the choices of a computer that plays rock with a probability of 0.5, paper with a probability of 0.3, and scissors with a probability of 0.2. Similarly to above, you can try simulating a number of choices, and you’ll see how the choices are averaged.

Simulate rock paper scissors

Loading Exercise...

Probability distribution

When simulating the coin tosses and the computer choices in the rock paper scissors game, you observed that the larger number of simulations there were, the closer the outcomes were to the actual probabilities. In fact, you were approximating the probability distribution of the outcomes.

A probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment.

Probability distributions can be formed based on data. The table below outlines a sequence of 10 choices from a game of rock paper scissors. The first move has been rock, the second move paper, the third move rock, and so on.

Playing history
rock rock rock paper rock rock paper scissors paper scissors

A probability distribution of the choices can be formed by counting the number of times each choice has been made, and dividing the counts by the total number of choices. In the above, rock has been chosen five times out of ten (50%), paper three times out of ten (30%), and scissors two times out of ten (20%). As a probability distribution, this would be the following:

Choice	Probability
rock	0.50
paper	0.30
scissors	0.20

When considering the above probabilities, this would already provide us insight on how to play against a player who has played moves at such probability. As rock is the most common choice, it would be a good idea to play paper, as paper beats rock.

Note that, as discussed in the part on the law of large numbers, one needs to do quite a few trials to reach the actual averages. For approximation, however, we can start with a smaller number.

Loading Exercise...

Conditional probability distribution

Let’s next look at forming a probability distribution, where we take the previous choice into account. This is called a conditional probability distribution.

A conditional probability distribution is a probability distribution that provides the probabilities of occurrence of different possible outcomes in an experiment, given that a certain condition has been met.

When forming a conditional probability distribution, instead of counting the number of times each choice has been made, we count the number of times each choice has been followed by a choice.

In the playing history, rock has been followed by a choice a total of five times. Of these five times, rock has been followed by rock three times and by paper two times (never by scissors). Thus, the probability distribution for rock after rock is 3/5, for paper after rock 2/5, and for scissors after rock 0/5. As a probability distribution, this would be the following.

Choice	Probability distribution of next choice
rock	rock: 0.60, paper: 0.40, scissors: 0.00

Similarly for paper, paper has been followed by rock once and by scissors twice, and the probability for rock after paper is 1/3, for paper after paper 0/3, and for scissors after paper 2/3. Adding to the previous table, our probability distribution would now be as follows.

Choice	Probability distribution of next choice
rock	rock: 0.60, paper: 0.40, scissors: 0.00
paper	rock: 0.33, paper: 0.00, scissors: 0.67

In a similar vein, we could also calculate the probability distribution for scissors. Scissors has been followed by a choice only once, when it was followed by paper. Thus, the probability for rock after scissors is 0/1, for paper after scissors 1/1, and for scissors after scissors 0/1. Adding on to the previous table, our probability distribution would now be as follows.

Choice	Probability distribution of next choice
rock	rock: 0.60, paper: 0.40, scissors: 0.00
paper	rock: 0.33, paper: 0.00, scissors: 0.67
scissors	rock: 0.00, paper: 1.00, scissors: 0.00

Patterns and longer sequences of data

Patterns often start to emerge when we look into longer sequences of data. We could, for example, look for the probability of the next move, taking the two previous moves into account. This is exactly what the AI in the rock paper scissors game does — nothing more, nothing less.

Loading Exercise...

Probabilistic models

Probabilistic models use information about prior events to predict future events. There are a handful of techniques for creating probabilistic models, but they can also use probability distributions and conditional probability distributions for modeling the likelihood of possible outcomes.

As an example, the AI of the rock paper scissors game recorded each choice that you made and constructed a conditional probability distribution of your choices. This was the probabilistic model that the AI used for predicting your next choice.

Probabilistic models are used in a wide range of applications, including in language models, which we look into next.

← Probabilities and Predictions

Towards Language Models →