Maths for AI 101: Fundamentals of Probability
What is Probability?
Probability is all about measuring uncertainty or the chances of something happening. For example, when we say, "the probability of a coin landing heads is 0.5," what are we really saying? There are two main ways to think about this: the Frequentist and the Bayesian views.
The Frequentist view sees probability as the long-term frequency of an event happening if you repeated it many times. So, if we say a coin has a 0.5 probability of landing heads, it means that if we flip the coin over and over again, we’d expect it to come up heads about half the time.
The Bayesian view, however, treats probability more like a measure of our own uncertainty or belief about an event. So, saying there's a 0.5 chance of getting heads simply means we're equally unsure whether the next flip will be heads or tails, without needing to flip it hundreds of times.
Even though these two interpretations look at probability differently, the core rules stay the same. Moving forward, we’ll dive deeper into probability from a Bayesian perspective, especially focusing on Bayes' Theorem, which is the backbone of Bayesian statistics and a key tool for updating our beliefs as we get new information.
Basic Terminologies
Probability helps us deal with uncertainty, but what exactly are we uncertain about? For example, when we toss a coin, we are uncertain whether we will get heads or tails. In this case, we know the possible outcomes—heads or tails—which together form our sample space. Once we perform the experiment of tossing the coin and, say, it lands on heads, this result is called the event of getting a head.
Now, if we are interested in an outcome that is not getting a head, we refer to this as the complement of the event. The complement of an event includes everything in the sample space that does not belong to the original event. For instance, if our event is getting a head, the complement of this event is getting a tail. The probabilities of an event and its complement always add up to 1. Thus, if the probability of getting a head is 0.5, the probability of getting a tail (the complement) is also 0.5.
Now let's consider rolling a die, where the sample space is S = {1, 2, 3, 4, 5, 6}. In probability experiments, we may encounter situations where some events overlap and others do not. For instance, let Event E be rolling an odd number {1, 3, 5} and Event F be rolling an even number {2, 4, 6}. These two events have no outcomes in common and do not overlap; such events are called disjoint events. On the other hand, consider Event E as rolling a number greater than 3 {4, 5, 6} and Event F as rolling an even number less than 6 {2, 4}. In this case, these events overlap because both include the outcome 4; such events are referred to as joint events.
When we have multiple events, we can perform operations on them to combine outcomes in different ways. If we want to find all outcomes that are present in both events, this operation is called intersection. On the other hand, if we want to find all outcomes that are present in either of the events, this operation is called union.
- Event E: Rolling an even number {2, 4, 6}
- Event F: Rolling a number greater than 3 {4, 5, 6}
- Intersection (E ∩ F): {4, 6}
- Event E: Rolling an even number {2, 4, 6}
- Event F: Rolling a number greater than 3 {4, 5, 6}
- Union (E ∪ F): {2, 4, 5, 6}
Axioms Of Probability
These axioms provide a foundation for deriving additional rules that will guide us toward understanding Bayes' Theorem.
Let's understand with the example :
Imagine we have a standard deck of 52 playing cards. Let's define two events:
We want to find the conditional probability of drawing a Queen, given that we have drawn a Heart
Since we know we have drawn a Heart (Event E), our sample space is reduced to the 13 Hearts in the deck. We are no longer considering the entire deck of 52 cards but only those 13 Hearts.
Now, we need to find the probability that we have drawn both a Heart and a Queen. There is only 1 Queen of Hearts in the deck, so:
Thus our formula P(F|E) = P(E ∩ F)/P(E), normalizes by dividing the joint probability of E and F by the probability of E. This effectively adjusts the probability from the original sample space (52 cards) to the new sample space where only Hearts (13 Cards) are considered.
Let say we have Sample Space (S) with more complicated event spaces, these spaces form partition of S provided A? ∩ A?= {} for i ≠ j and Union of all forms a Sample Space.
领英推荐
Now we are putting a disc on top of it, which represent Event (B). Then what will be probability event B?
As per the Law of Probability it is as mentioned below :
From the Fig.10 and formula provided (in Fig.11), we see that Event B and the subsequent event spaces A? overlap. This means that B intersects with each partitioned event space A?, forming new sub-events that belong to both B and A?. To find the total probability of event B, we use the Law of Total Probability, which tells us how to express the probability of an event that spans across multiple partitions of the sample space.
Think of each partition A?, as a separate "world" in which event B might occur. The conditional probability P(B|A?) tells us how likely B is to occur within that specific "world," and P(A?) tells us the likelihood of being in that "world" in the first place. The Law of Total Probability essentially combines these possibilities to give us the overall probability of B happening across the entire sample space S.
Let's understand it with example of Pizza :
Imagine you have a big pizza (the Sample Space, S), and this pizza is cut into different slices of different flavors (these slices are like the event spaces: A1, A2, A3,...). Now, each slice has a different flavor, and together, they make up the whole pizza.
Now, let's say you put a piece of cheese on the pizza. The cheese lands on some parts of different slices (this is Event B). The question is, how much of the pizza has cheese on it?
To figure that out, you need to look at each slice separately. You see how much cheese is on each slice and add it all up. This is what the formula does! It adds up all the little parts of the slices that have cheese (the cheese on A1, the cheese on A2, and so on) to get the total amount of cheese on the pizza.
So, the formula is like saying, "Let's see how much cheese is on each flavor slice and then add them all together to know how much of the whole pizza has cheese!"
Baye's Theorem
By just looking at above formula you might have figured out it is derived from the Conditional Probability and uses Law of total probability
Let's understand this with an example
Imagine you have a toy box with two types of toy cars: red cars and blue cars. Some cars make noise, and some are quiet. You want to find a red car that makes noise. But you don't know which car will be noisy until you pull one out!
What We Know About the Toy Box
- There are 10 red cars and 10 blue cars in the toy box (20 cars total).
- Out of the 10 red cars, 2 are noisy.
- Out of the 10 blue cars, 6 are noisy.
At the start, you might think, "If I pick a noisy car, it could be red or blue." But as you get more information (like hearing noise), you need to update your belief.
New Information — You Hear Noise!
You reach into the toy box without looking and pull out a car. You hear it making noise! Now you know it’s a noisy car, but you don’t know if it’s red or blue.
You start to wonder, "How likely is it that this noisy car is red?"
Bayes' Theorem helps us update our belief (hypothesis) based on the new evidence (information).
- Hypothesis (H): "The noisy car is red."
- Evidence (E): "The car makes noise."
Where :
- P(H|E): The updated belief — probability that the car is red given that it makes noise.
- P(E|H): The likelihood — probability that a red car makes noise. Since 2 out of 10 red cars make noise, P(E|H) = 2/10 = 0.2.
- P(H): The prior belief — probability that any car you pick is red before knowing if it makes noise. Since there are 10 red cars out of 20 total cars, P(H) = 10/20 = 0.5.
- P(E): The total evidence — probability of picking any noisy car, regardless of color. There are 2 noisy red cars and 6 noisy blue cars, so P(E) = (2 + 6)/20 = 8/20 = 0.4.
Applying Bayes' Theorem
Now, let’s calculate the probability that the noisy car is red using Bayes' Theorem:
P(H|E) = 0.2 x 0.5/0.4 =0.25
- P(H|E) = 0.25 means there is a 25% chance the noisy car is red after hearing it makes noise.
- At first, you might think there's a good chance a noisy car could be red or blue. But after hearing the noise and knowing that more noisy cars are blue, you realize the chance of it being a noisy red car is actually lower (only 25%).
Bayes' Theorem is like being a clever detective. You start with a guess (hypothesis) — "I might have a noisy red car." But when you get new evidence (you hear the noise), you use Bayes' Theorem to update your guess. Now, you realize it's more likely a noisy blue car because there are more noisy blue cars in the box. This is how Bayes' Theorem helps us change our belief when we learn something new!