Understanding Bayesian with Examples In Python
Rany ElHousieny, PhD???
Generative AI ENGINEERING MANAGER | ex-Microsoft | AI Solutions Architect | Generative AI & NLP Expert | Proven Leader in AI-Driven Innovation | Former Microsoft Research & Azure AI | Software Engineering Manager
Bayesian estimation is a statistical method that involves updating the probability estimate for a hypothesis as additional evidence is acquired. This approach to probability allows us to systematically improve our estimates in light of new data.
Note: This article is a continuation of the previous article:
Bayes' Theorem
Bayes' Theorem is a formula that describes how to update the probabilities of hypotheses when given evidence. It relies on incorporating prior knowledge and observed data. The theorem is mathematically stated as:
To understand the concept, let's solve the snowfall problem below:
A father claims about snowfall last night. The first daughter tells that the probability of snowfall on a particular night is 1/8. The second daughter says that 5 out of 6 times, the father is lying! What is the probability that there actually was a snowfall?
To solve this problem using Bayes' theorem, let's define the events:
We are given:
We want to find P(S∣C), the probability that there was snowfall, given that the father claims there was snowfall.
Bayes' theorem states that:
We have P(S), but we need P(C∣S) and P(C).
Since the father's claim is about snowfall, we consider two cases for P(C):
Now we can find P(C), the total probability that the father claims there was snowfall:
Now, we can apply Bayes' theorem:
The probability that there was actually snowfall given the father's claim is 1/36, which is quite low.
Insights from the result using Bayes' theorem:
This low probability might seem counterintuitive at first since the father did claim there was snowfall. However, the father's credibility is low (he tells the truth only 1/6 of the time), and the prior probability of snowfall on any given night is also low (1/8).
Bayes' theorem combines these pieces of information to update our belief in the event of snowfall based on the father's claim. It reflects the fact that a claim from an unreliable source does not significantly increase the likelihood of the event being true, especially when the event itself is already known to be rare.
Thus, despite the claim of snowfall, the probability of it having actually snowed remains low when considering the father's tendency to lie and the low likelihood of snowfall in general. This demonstrates how Bayes' theorem can adjust our beliefs based on the reliability of new information and the prior likelihood of the event.
Now, let's solve it with Python:
# Prior probability of snowfall
P_S = 1/8
# Probability that the father is lying
P_L = 5/6
# Probability that the father is telling the truth
P_T = 1 - P_L
# Probability of the father claiming snowfall, given there was snowfall
P_C_S = 1/6
# Probability of the father claiming snowfall, given there was no snowfall (lying)
P_C_NS = P_L
# Using Bayes' Theorem to find the posterior probability of snowfall given the claim
P_S_C = (P_C_S * P_S) / ((P_C_S * P_S) + (P_C_NS * (1 - P_S)))
print(f"Probability of actual snowfall given the father's claim: {P_S_C:.2f}")
Probability of actual snowfall given the father's claim: 0.03
Understanding the Problem
Variables Defined in the Program
Bayes' Theorem
Bayes' Theorem is used to update the probability of snowfall based on the father’s claim. The theorem is stated as:
Where:
Calculating the Posterior Probability
Analyzing the results:
The result of the program, which calculates the probability of actual snowfall given the father's claim, came out to be approximately 0.03, or 3%. Let's delve into what this means and the insights we can gain from it:
In summary, the 0.03% probability reflects a situation where a claim is made under questionable credibility, combined with an event (snowfall) that was initially unlikely. This result demonstrates the power of Bayesian analysis in updating beliefs and making decisions under uncertainty.
The Chocolate Example
Imagine a box having 50% dark chocolates and 50% white chocolates. Half of the dark chocolate is wrapped in gold paper and another half in silver paper. All the white chocolates are wrapped in silver paper. A kid picked chocolate from the box, wrapped in silver paper. What is the probability that the picked one is dark chocolate?
To solve this problem using Bayes' theorem, we want to find the probability that the picked chocolate is dark, given that it is wrapped in silver paper P(D∣S).
Here's what we know:
Bayes' theorem states that:
Substituting the given probabilities into the equation gives us:
Now, calculate the result:
P(D∣S)=0.25/ 0.75
P(D∣S)=1/3
Therefore, the probability that the chocolate is dark, given that it is wrapped in silver paper, is 1/3 or approximately 33.33%.
The result we obtained using Bayes’ theorem in the context of the chocolate problem is insightful in understanding how prior knowledge can update our beliefs about a particular event.
Bayes' theorem, in essence, provides a mathematical framework for updating our beliefs based on new evidence. It allows us to refine our predictions about the probability of an event by incorporating our existing knowledge (the prior) and the new evidence (the likelihood).
In the chocolate example, the prior knowledge is that there is a 50% chance of picking a dark chocolate at random from the box P(D). However, this probability is not very informative when we are presented with additional evidence: the chocolate we picked is wrapped in silver paper P(S).
Before considering the wrapping, the chances of picking a dark chocolate were equal to picking a white chocolate. But once we account for the fact that we've picked a silver-wrapped chocolate, we need to update this probability because not all chocolates are wrapped in silver. All white chocolates are in silver wrappers, but only half of the dark chocolates are. This shifts the likelihood toward a white chocolate because there are more silver-wrapped white chocolates than silver-wrapped dark chocolates.
By applying Bayes’ theorem, we integrate this new evidence to update our belief. The likelihood of picking a silver-wrapped chocolate if it’s dark is P(S∣D)=0.5, but the overall probability of picking a silver-wrapped chocolate is P(S)=0.75. This overall probability of picking a silver wrapper is greater because it includes all the white chocolates as well.
The updated probability P(D∣S)=1/3 tells us that, given the evidence that a silver-wrapped chocolate has been picked, there is a roughly 33.33% chance that this chocolate is dark. This is less than our initial belief P(D)=0.5, which makes sense intuitively: since there are more silver-wrapped white chocolates, picking a silver wrapper makes it less likely that we’ve picked a dark chocolate compared to our initial assumption before we knew the color of the wrapper.
Thus, Bayes' theorem crucially allows us to move from our prior (initial guess based on overall proportions) to our posterior (revised guess, taking the evidence into account), giving us a more accurate assessment of the situation based on all available information.
Considering the chocolate problem, we can use Python to calculate the posterior probability that the selected chocolate is dark, given that it is wrapped in silver paper.
# Probability of picking dark chocolate (prior)
P_D = 0.5
# Probability of picking silver paper chocolate
P_S = 0.75 # All chocolates are in silver paper except half of the dark ones
# Probability of silver paper given dark chocolate
P_S_D = 0.5
# Posterior probability of dark chocolate given silver paper
P_D_S = (P_S_D * P_D) / P_S
print(f"Probability that the chocolate is dark given it's wrapped in silver paper: {P_D_S:.2f}")
Probability that the chocolate is dark given it's wrapped in silver paper: 0.33
Bayesian Equation:
The basic equation of Bayesian inference is given by P(θ∣y) = P(y∣θ)∝P(θ), where:
Prior and Posterior Distributions
The concept of prior and posterior distributions is fundamental to Bayesian inference. The prior distribution encapsulates our beliefs about an unknown parameter before we have seen the data. The posterior distribution, on the other hand, is the updated belief after observing the data.
Conjugate Priors
In Bayesian statistics, conjugate priors are a way to simplify the process of updating beliefs with new evidence. A conjugate prior is a prior distribution that, when combined with a likelihood function from the same distribution family, yields a posterior distribution that is also from the same distribution family. This property makes the analytical computation of the posterior distribution much simpler.
Several common likelihood functions along with their corresponding conjugate priors:
The use of conjugate priors simplifies the Bayesian updating process, because the posterior distribution can be derived algebraically from the prior and the likelihood without the need for numerical integration or advanced sampling methods. This makes the computation of the posterior distribution more manageable, especially in cases where there's a need for repeated updates with new data.
Beta and Binomial Distributions
The Beta distribution is often used as a conjugate prior for the Binomial distribution because of this simplification. In the case of the Beta distribution, the probability parameter is itself a random variable, while for the Binomial distribution, the probability of success is a fixed parameter.
Example for Binomial/Bernoulli Likelihood - Beta Distribution
Here’s an example involving the beta distribution as a conjugate prior for the binomial distribution:
Problem: Suppose you have a non fair coin and you are interested in estimating the probability p that the coin lands heads up. You don't know p, so you decide to model your beliefs about p using a beta distribution, which is the conjugate prior for the binomial likelihood function you will use.
Why Did we use a Beta distribution
The Beta distribution is used as the prior in the coin flip problem for several key reasons:
领英推荐
This graph shows a series of Beta distributions with various α (alpha) and β (beta) parameter values. The Beta distribution is a continuous probability distribution defined on the interval [0, 1], and is frequently used in Bayesian statistics to model the distribution of probability values.
The parameters α and β control the shape of the distribution:
Let's go through the curves on the graph:
In summary, this graph visually demonstrates how the choice of α and β parameters in the Beta distribution can express different prior beliefs about a probability value. This flexibility makes the Beta distribution very useful for Bayesian inference, especially for binary or proportion data like the rate of success in a series of experiments or the probability of an event occurring.
4. Analytical Tractability: Working with the Beta distribution in conjunction with the binomial likelihood allows for closed-form solutions for the posterior distribution's parameters. This is particularly advantageous computationally, as it avoids the need for numerical methods or simulation-based approaches like Markov Chain Monte Carlo (MCMC).
In the context of an unfair coin, where p is not known and you cannot assume it to be 0.5 (as you would with a fair coin), the Beta distribution lets you incorporate your prior beliefs about p. For instance, if you believe the coin is biased towards heads, you might choose a Beta distribution with α>β to reflect this belief.
After you observe data from flipping the coin (your empirical evidence), you can update your Beta distribution parameters by adding the number of observed heads to α and the number of tails to β. This results in a new Beta distribution that reflects both your prior beliefs and the evidence from the coin flips, giving you the posterior distribution for p.
Prior: You choose a beta distribution with parameters α and β (written as Beta(α,β)) to model your prior beliefs about p. Suppose you want to start with a relatively uniform prior, which expresses uncertainty about whether the coin is biased or not, so you choose α=2 and β=2.
Data: You flip the coin 10 times, resulting in 7 heads and 3 tails.
Likelihood: The likelihood of observing 7 heads in 10 tosses, if the true probability of heads is p, is given by a binomial distribution:
Binomial(n=10,k=7,p=p).
Posterior: Using the conjugate prior property, you can calculate the posterior distribution analytically. The posterior parameters α′ and β′ for a beta distribution are updated as follows:
So after observing the data, the updated α′ and β′ are:
The posterior distribution is therefore Beta(α′,β′)=Beta(9,5).
This posterior distribution now represents your updated beliefs about the probability p after taking into account the evidence from your coin flips. It is more peaked around the ratio of heads to total flips, indicating increased confidence that p is closer to 0.7 than to 0.5. However, it still reflects some uncertainty due to the relatively small number of flips.
The advantage of conjugate priors is that you can continue to update your beliefs as you gather more data by simply updating the parameters α′ and β′ without having to perform complex integrations or use numerical methods to find the posterior distribution.
Example 2:
Let’s dig into the concept of prior and posterior distributions through an example.
Prior Distribution
Imagine you are a scientist studying a new drug meant to reduce blood pressure. Before any tests are done, based on previous studies and theoretical knowledge, you have some belief about the effectiveness of this drug. Perhaps you believe there is a 70% chance that this drug can effectively reduce blood pressure in patients with hypertension.
To represent this belief in a Bayesian framework, you would use a prior distribution. A common distribution to use for a probability like this is the Beta distribution because it is defined on the interval [0, 1], which is suitable for probabilities.
You might choose a Beta distribution with parameters that reflect your belief. In this case, a Beta(α=7,β=3) might be a reasonable choice because the expected value
0.7 is reflecting your belief that there is a 70% chance of the drug being effective.
Posterior Distribution
Now, let’s say you run a clinical trial with 100 patients and find that 80 of them show a significant reduction in blood pressure after taking the drug. With this data, you want to update your belief about the drug's effectiveness.
In a Bayesian context, this update is done by calculating the posterior distribution. The data you collected (80 out of 100) will be modeled as a likelihood function, which in this case could be a binomial likelihood because you're counting the number of successes out of a fixed number of trials.
Given that the Beta distribution is a conjugate prior for the binomial likelihood, you can easily update your Beta parameters. The posterior α (alpha) will be your prior α plus the number of successes (80), and the posterior β (beta) will be your prior β plus the number of failures (20).
Note:The Beta distribution is used as the prior in this scenario because of its mathematical convenience and its relevance to problems involving proportions or probabilities.
So, your posterior distribution is Beta(αposterior=7+80,βposterior=3+20) which simplifies to Beta(αposterior=87,βposterior=23).
Interpretation of the Posterior Distribution
The posterior distribution represents your updated belief about the drug's effectiveness after considering the new evidence from the clinical trial. In this case, the posterior distribution is centered around 8787+23≈0.7987+2387≈0.79, which suggests that, given the trial results, you now believe there's a 79% chance of the drug being effective.
Insights and Decisions
The shift from the prior to the posterior distribution is a crucial part of Bayesian inference. It quantifies how evidence changes our beliefs. In practice, the posterior distribution can then be used for decision-making. For example, if the posterior probability that the drug is effective exceeds a certain threshold, it might be considered for approval by medical authorities.
This Bayesian approach contrasts with frequentist statistics, where you would use the data to perform hypothesis testing and calculate p-values without considering prior beliefs. In Bayesian inference, both prior beliefs and new data are integrated to inform the final analysis.
Example3: Bayesian Estimation with Beta-Binomial Conjugacy
Let's say we have a coin and we want to estimate the probability of it landing heads (H) using a Beta prior and updating this belief as we observe new data (coin flips).
from scipy.stats import beta
import matplotlib.pyplot as plt
import numpy as np
# Prior belief: Beta distribution with alpha=2, beta=2 (symmetric belief)
alpha_prior = 2
beta_prior = 2
# We flip the coin 10 times and observe 6 heads
heads, tails = 6, 4
# Update the prior to get the posterior distribution
alpha_posterior = alpha_prior + heads
beta_posterior = beta_prior + tails
# Plot the prior and posterior Beta distributions
x = np.linspace(0, 1, 100)
plt.plot(x, beta.pdf(x, alpha_prior, beta_prior), label='Prior')
plt.plot(x, beta.pdf(x, alpha_posterior, beta_posterior), label='Posterior')
plt.title('Beta Distributions of Prior and Posterior')
plt.xlabel('Probability of Heads')
plt.ylabel('Density')
plt.legend()
plt.show()
In the plot, you observe that the peak of the posterior distribution has shifted towards the observed relative frequency of heads, reflecting our updated belief about the probability of the coin landing heads up.
Performing Bayesian Estimation in Python
With these concepts in mind, Bayesian estimation can be performed in Python using probabilistic programming libraries like PyMC3 or Stan. These libraries allow for more complex models and inferences, handling the computational complexity under the hood.
Summary
Bayesian estimation offers a powerful framework for understanding uncertainty and making decisions under uncertainty. By starting with prior beliefs and systematically updating these beliefs as new data becomes available, Bayesian methods can provide more nuanced and adaptive inferences than traditional frequentist statistics. The use of conjugate priors, where applicable, simplifies computations and is especially useful when real-time updates to the probability distributions are required.
=====================
Practice Problem1: Medical Diagnostic Test
Scenario:
Imagine you are evaluating a new diagnostic test for a disease that is present in 1% of the population. The test has a 95% probability of correctly identifying an individual with the disease (true positive rate). However, the test also has a 2% probability of incorrectly indicating the disease in an individual who doesn't have it (false positive rate).
Task:
A patient takes the test and receives a positive result. Use Bayesian estimation to calculate the probability that the patient actually has the disease, given the positive test result.
Given:
To Calculate:
=======
Solution:
Before starting the solution, watch the following video:
Let’s walk through the problem step-by-step. We'll apply Bayes' Theorem to find the posterior probability, which is the probability that the patient actually has the disease, given that they tested positive.
Given:
We want to find P(Disease∣Positive), the probability that the patient has the disease given a positive test result.
We can calculate it using Bayes' Theorem:
We have everything except P(Positive), which is the total probability of testing positive. We can calculate this using the Law of Total Probability:
Where:
Let’s calculate this in Python:
# Given probabilities
P_Disease = 0.01
P_Positive_Disease = 0.95
P_Positive_NoDisease = 0.02
P_NoDisease = 1 - P_Disease
# Total probability of testing positive
P_Positive = (P_Positive_Disease * P_Disease) + (P_Positive_NoDisease * P_NoDisease)
# Posterior probability of having the disease given a positive test result
P_Disease_Positive = (P_Positive_Disease * P_Disease) / P_Positive
print(f"The probability of the patient having the disease given a positive test result: {P_Disease_Positive:.4f}")
The probability of the patient having the disease given a positive test result: 0.3242
Now, let’s plot the prior and posterior probabilities to visualize how our belief about the presence of the disease changes after we receive a positive test result.
import matplotlib.pyplot as plt
# Probabilities for the plot
probabilities = {
'Prior': P_Disease,
'Posterior': P_Disease_Positive
}
# Create bar plot
plt.bar(probabilities.keys(), probabilities.values(), color=['blue', 'green'])
# Adding the text labels on the bars
for i, value in enumerate(probabilities.values()):
plt.text(i, value, f"{value:.4f}", ha='center', va='bottom')
plt.title('Prior vs Posterior Probability of Disease')
plt.ylabel('Probability')
plt.ylim(0, 1)
plt.show()
This diagram shows the stark difference between our initial belief (prior) and updated belief (posterior) regarding the likelihood of the patient having the disease.
The plot shows that despite a positive test result, the actual chance of having the disease is still not extremely high due to the low prior probability (prevalence) of the disease and the test's false positive rate. This is a classic example of how Bayesian updating can yield counterintuitive but more accurate results by incorporating prior knowledge and evidence.
Additional Resources: