?? Mastering Probability Distributions: A Beginner’s Guide ??
Amarendra Nayak
?? Data Analyst | Power BI Developer | DAX | Turning Raw Data into Actionable Insights?? | Data Enthusiast | Visualization & Analytics Expert | Power BI | SQL | Python | Immediate Joiner
When analyzing data, it’s important to know its shape or distribution. Why? Because it tells us how the data behaves, helping us choose the right analysis techniques. In this blog, we’ll explore the most common types of data distributions and how to recognize them.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
?Distribution:
At its core, distribution refers to the way probabilities or frequencies are shared among various data points or outcomes in a random process. It gives us insight into how the total probability of an event is distributed across different possibilities in a random experiment.
Here’s a breakdown of the key concepts that define distribution:
????Random Variable
A random variable is a numerical representation of the outcomes from a random process. It helps us quantify uncertainty. For example, in tossing a coin, the random variable might represent the outcomes “Heads” (1) or “Tails” (0).
???? Probability Mass Function (PMF)
For discrete distributions, the PMF assigns a specific probability to each individual outcome. Think of it as a map showing how likely each distinct value is. For instance, the roll of a six-sided die has a PMF where each outcome (1 through 6) has a probability of 16\frac{1}{6}61.
????Probability Density Function (PDF)
For continuous distributions, we can’t pinpoint exact probabilities for specific outcomes (as there are infinite possibilities). Instead, the PDF gives us a density of probabilities over a range of values. For example, the heights of people in a population might follow a bell-shaped curve, where the PDF shows which height ranges are more common.
????Cumulative Distribution Function (CDF)
The CDF represents the probability that a random variable is less than or equal to a given value. It’s a way of accumulating probabilities. For example, in a dice roll, the CDF at 3 would be the probability of rolling a 1, 2, or 3, which is 36=0.5\frac{3}{6} = 0.563=0.5.
?Why Is This Important?
Understanding distribution helps us uncover patterns in data, make predictions, and quantify uncertainty in decision-making. Whether you’re analyzing sales trends, studying customer behavior, or building machine learning models, distribution is a cornerstone concept in data analysis and statistics.
By breaking down these characteristics, we can better grasp how probability flows through data, enabling us to draw meaningful insights and make informed decisions.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Types of Probability Distributions:
?Discrete Probability Distributions:
A discrete probability distribution describes the probabilities of the outcomes of a random variable that can take on a finite or countable number of distinct values. It provides a way to model situations where outcomes are discrete (e.g., integers, whole numbers).
Key Features:
2. Probability Function: Assigns probabilities to each possible value of the random variable.
3. Sum of Probabilities: The sum of all probabilities is equal to 1. ∑P(X=x)=1
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
?Bernoulli Distribution:
The Bernoulli distribution is a discrete probability distribution that takes a binary outcome, typically 1 for success and 0 for failure. It can be used to model the probability of success or failure in a single experiment or trial.
Formula
The Probability Mass Function (PMF) of a Bernoulli random variable K is:
Where:
Example:
1.What is the probability getting less than 3 when the rolling dice?
RE: Rolling a dice
Total possible outcomes : {1,2,3,4,5,6}
Favorable outcomes : {2,4,6}
probability:3/6
=0.5
2. A coin is flipped once. The probability of getting Heads (P(Heads) is 0.5, and the probability of getting Tails (P(Tails) is also 0.5.
3. Here is an example code to plot a Bernoulli distribution with a success probability of 0.6
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bernoulli
# Define the success probability p
p = 0.6
# Create a Bernoulli distribution object
dist = bernoulli(p)
# Generate some random samples
samples = dist.rvs(size=1000)
# Calculate the probability mass function for all possible outcomes
x = np.arange(2)
pmf = dist.pmf(x)
# Plot the probability mass function
fig, ax = plt.subplots()
ax.stem(x, pmf, use_line_collection=True)
ax.set_xlabel('Outcome')
ax.set_ylabel('Probability')
ax.set_title('Bernoulli Distribution (p=0.6)')
plt.show()
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
?Binomial distribution:
The binomial distribution represents a discrete probability distribution that applies to experiments characterized by two mutually exclusive outcomes, often called Bernoulli trials. It is utilized in sequences of independent trials where only two possible outcomes exist.
The Binomial Distribution has four attributes:
2. Every trial can result in one of two possible outcomes.
3. The probability of success in any given trial is constant, denoted by p, which implies that the probability of failure is consistently q=1-p.
4. Trials are mutually independent, meaning the outcome of one trial does not influence the outcome of another.
Formula
The probability of observing exactly k successes in n trials is given by the Probability Mass Function (PMF):
for k = 0, 1, 2, …, n, where
Example:
1.A coin has a probability of 0.3 of landing on heads. If the coin is flipped 8 times, find the probability of getting exactly 3 heads.
p = 0.3 , n = 8 , k = 3
=8!/(8–3)!*3! (0.3)3(0.7)?
= 56*(0.6)3(0.7)?
=0.2541
2.Given a coin flip where the probability of obtaining heads is 1/2 and the probability of obtaining tails is also 1/2. If the coin is tossed five times, what is the probability of achieving two heads and three t
from math import comb
N = 5
X = 2
p_heads = 1/2
p_tails = 1/2
# The binomial probability formula
# P(X) = C(N, X) * (p^X) * ((1-p)^(N-X))
probability = comb(N, X) * (p_heads ** X) * (p_tails ** (N - X))
# result: 0.3125
The binomial distribution chart is symmetric.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
?Poisson Distribution:
The Poisson Distribution models the number of times an event occurs in a fixed interval of time, space, or other continuous domain, provided that the events occur independently and at a constant average rate.
Formula:
e :is the base of the logarithm
x: is a Poisson random variable
λ: is an average rate of value
Poisson distribution is used under certain conditions. They are:
In Poisson distribution, the mean is represented as E(X) = λ.
For a Poisson Distribution, the mean and the variance are equal. It means that E(X) = V(X)
Where,
V(X) is the variance.
Example:
The number of customers arriving at a store follows a Poisson distribution with a mean of 5 customers per hour . Find the probability that exactly 3 customers will arrive in the next hour.
λ =5 , k= 3
领英推荐
p(x=3) =?3 e-?/3! = 0.4103…
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
?Continuous Probability Distributions:
A continuous probability distribution describes the probabilities of the outcomes of a random variable that can take on an infinite number of values within a given range. These distributions are used when the random variable is continuous, meaning it can assume any value within an interval.
Key Features:
2. Probability Density Function (PDF):
3. Total Area Under the Curve: The total area under the PDF curve is equal to 1.
1.Uniform Distribution:
The uniform distribution is a probability distribution where each value within a certain range is equally likely to occur and values outside of the range never occur. If we make a density plot of a uniform distribution, it appears flat because no value is any more likely (and hence has any more density) than another.
Properties:
A discrete uniform distribution is a symmetric distribution with following properties.
If a random variable X follows discrete uniform distribution and it has k discrete values say x1, x2, x3,…..xk, then PMF of X is given as
Formula:
Mean and Variance:
Properties:
Example:
A bus arrives randomly between 10:00 AM and 10:20 AM. Find the probability it arrives between 10:05 and 10:10.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
2. Exponential Distribution ??
For example, if the mean rate of messages per hour, λ, is 240, then the average time between 2 messages would be (1/240) hrs = (3600/240) seconds = 15 seconds.
The probability density function (pdf) for an exponential distribution is given by the equation:
where: λ = rate at which an event occurs x = random variable (time between 2 events) f(x; λ) = probability of time between 2 events being x units
Lets plot an exponential distribution using Python:
Q. Plot exponential distributions given that the average time between two successive messages is 50, 60 and 70 seconds.
from scipy.stats import expon
import matplotlib.pyplot as plt
import seaborn as sns
#When average time between 2 messages is 50 seconds
data1 = expon.rvs(scale=50, size=10000)
#When average time between 2 messages is 60 seconds
data2 = expon.rvs(scale=60, size=10000)
#When average time between 2 messages is 80 seconds
data3 = expon.rvs(scale=80, size=10000)
#Plot sample data
sns.kdeplot(x=data1, fill=True, label='1/lambda=50')
sns.kdeplot(x=data2, fill=True, label='1/lambda=60')
sns.kdeplot(x=data3, fill=True, label='1/lambda=80')
plt.xlabel('Units of time between successive events')
plt.ylabel('Probability')
plt.title('Exponential Distribution')
plt.legend()
plt.xlim(0, 200)
plt.show();
The exponential distribution has an extreme right skew, dragging the mean towards the right side of the peak. As (1/λ) increases, the skewness also increases and the mean moves further away from the peak.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
3. Normal Distribution ??
How to check:
Fun fact: The Central Limit Theorem says if you take many samples, their means will always follow a normal distribution — even if the data isn’t normally distributed!
Here is an example Python code that generates a dataset with a normal distribution and plots the histogram of the data using the matplotlib library:
import numpy as np
import matplotlib.pyplot as plt
# Generate a dataset with a normal distribution
mean = 5
std_dev = 2
data = np.random.normal(mean, std_dev, 1000)
# Plot the histogram of the data
plt.hist(data, bins=20)
plt.xlabel('Data')
plt.ylabel('Frequency')
plt.title('Normal Distribution')
plt.show()
In this code, we first set the mean and standard deviation of the distribution to be 5 and 2, respectively, and then use the numpy.random.normal() function to generate a dataset of 1000 data points that follow a normal distribution with these parameters. Finally, we plot the histogram of the data using the plt.hist() function from the matplotlib library, which shows the shape of the distribution as a bell curve.
Certainly! The probability density function (PDF) of the normal distribution is given by the following mathematical equation:
f(x) = (1/σ√(2π)) * e^(-((x-μ)2)/(2σ2))
where:
This equation describes the shape of the normal distribution curve, which is symmetrical around the mean value. The parameter σ determines how spread out the curve is, with smaller values of σ resulting in a narrower, more peaked curve, and larger values of σ resulting in a flatter, more spread-out curve. The parameter μ determines the location of the curve along the x-axis.
The cumulative distribution function (CDF) of the normal distribution is given by the following equation:
F(x) = 1/2 * [1 + erf((x-μ)/(σ√2))]
where:
The CDF describes the probability that a random variable from the normal distribution is less than or equal to a certain value x. It is useful for calculating percentiles and probabilities of events in a normal distribution.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
4. Standard Normal Distribution:
A Standard Normal Variate(Z) is a standardized form of the normal distribution with mean = 0 and standard deviation = 1.
Z = (X — μ) / σ
where, μ is the mean of X and σ is the standard deviation of X.
The standard normal variate (Z-score) is important in statistics for a number of reasons:
Here’s an example of how the Z-score can be used in statistics:
Suppose we have a dataset of students’ exam scores, and we want to compare the scores of two different classes to see which one performed better. However, the scores in the two classes are measured on different scales, with one class having a mean of 70 and a standard deviation of 10, and the other class having a mean of 75 and a standard deviation of 5.
Here’s another example to demonstrate how the Z-score can be used to calculate probabilities:
Suppose that the weight of a certain population of dogs follows a normal distribution with a mean of 30 kilograms and a standard deviation of 5 kilograms. We want to find the probability of selecting a dog from this population that weighs between 25 and 35 kilograms.
Example :
Suppose the heights of adult males in a certain population follow a normal distribution with a mean of 68 inches and a standard deviation of 3 inches. What is the probability that a randomly selected adult male from this population is taller than 72 inches?
Normal distribution has several properties that make it useful in statistical analysis. Here are some of the key properties of the normal distribution:
5. Z-scores: Z-scores are a way to standardize data using the mean and standard deviation of a normal distribution. Z-scores tell us how many standard deviations away from the mean a given data point is, and they can be used to calculate probabilities.
These properties make the normal distribution a useful tool for statistical analysis, as it allows us to make predictions and draw conclusions about large datasets.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
5. Lognormal Distribution ??
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
6. Pareto Distribution ??
Example:
How to check:
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
How to Visualize These Distributions ??
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — Why Do Distributions Matter?
Understanding distributions helps you:
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Conclusion:
Data distributions aren’t just theory — they’re the key to understanding your dataset. Whether your data follows a normal curve, decays exponentially, or adheres to Pareto’s principle, identifying its distribution helps you unlock powerful insights.