Path to Data science - Zero to Hero Series 1 - Week2
RAHUL Muralidharan
Datascience enthusiastic | IIM Calcutta | IIT Patna | Data governance | Communication and Trade Surveillance
Let’s start our week2 exploration on distribution.
?? This week on our journey through data analysis, we're diving deeper into the world of distributions! Last time, we explored the significance of inference, the importance of statistics in data analysis, and touched on the probability of data. If you missed out on the previous article, be sure to catch up before delving into this week's content. Stay tuned for an enlightening exploration of different types of distributions and how they shape our understanding of data. Don't miss out! #DataAnalysis #Statistics #Probability #Distributions
Normal Distribution:
Data can be "distributed" (spread out) in different ways such as below.
But there are many cases where the data tends to be around a central value with no bias left or right, and it gets close to a "Normal Distribution" like this: It is often called a "Bell Curve" because it looks like a bell.
The normal distribution, also called the Gaussian distribution, de Moivre distribution, or “bell curve,” is a probability distribution that is symmetric about a central mean: half of data falls to the left of the mean and half falls to the right. The bulk of data are clustered around the mean, which results in a bell-shaped curve when graphed.
The Normal Distribution has:
? mean = median = mode
? symmetry about the center
? 50% of values less than the mean and 50% greater than the mean
Many examples in real world tend to closely follow a Normal Distribution:
The Standard Deviation is a measure of how spread out numbers are (read that page for details on how to calculate it).
In the above image:
The empirical rule tells us that:
? 68% of data falls within one standard deviation of the mean.
? 95% of data falls within two standard deviations of the mean.
? 99.7% of data falls within three standard deviations of the mean.
The empirical rule tells you what percentage of your data falls within a certain number of standard deviations from the mean. The standard deviation controls the spread of the distribution. A smaller standard deviation indicates that the data is tightly clustered around the mean, resulting in a taller and thinner normal distribution. A larger standard deviation indicates that the data is spread out around the mean; the normal distribution will be flatter and wider.
Binomial Distribution:
A classic example of probability distribution is the binomial distribution. It is the representation of the probability when only two events may happen, that are mutually exclusive.
The probability distribution becomes a binomial probability distribution when it meets the following requirements.
Binomial Distribution Formula in Probability
The formula for the binomial probability distribution is as stated below:
P(x) = nCx · px (1 ? p)n?x (or) P(r) = [n!/r!(n?r)!]· pr (1 ? p)n?r
Where,
n = Total number of events
r (or) x = Total number of successful events.
p = Probability of success on a single trial.
nCr = [n!/r!(n?r)]!
1 – p = Probability of failure.
Let’s work out an example:
The probability that a person can achieve a target is 3/4. The count of tries is 5. What is the probability that he will attain the target at least thrice?
Solution:
Given that, p = ?, q = ?, n = 5.
Using binomial distribution formula, we get P(X) = nCx · px (1 ? p)n?x
Thus, the required probability is: P(X = 3) + P(X=4) + P(X=5)
= 5C3 · (?)3 (? )2 + 5C4 · (?)4 (? )1 +5C5 · (?)5
= 459/512.
Therefore, the probability that the person will attain the target atleast thrice is 459/512.
Another example: Number of Fraudulent Transactions
Banks use the binomial distribution to model the probability that a certain number of credit card transactions are fraudulent.
For example, suppose it is known that 2% of all credit card transactions in a certain region are fraudulent. If there are 50 transactions per day in a certain region, we can use a Binomial Distribution Calculator to find the probability that more than a certain number of fraudulent transactions occur in a given day:
P(X > 1 fraudulent transaction) = 0.26423
P(X > 2 fraudulent transactions) = 0.07843
P(X > 3 fraudulent transactions) = 0.01776
And so on.
This gives banks an idea of how likely it is that more than a certain number of fraudulent transactions will occur in a given day.
Properties of Binomial Distribution
Shape of Binomial Distribution
Binomial Distribution may be symmetrical or skewed. If the probability of success, p, is equal to 0.5, then the binomial distribution would be symmetrical, regardless of the value of n. If p < 0.5, the distribution will be positively skewed; while for p > 0.5, the distribution will be negatively skewed. Further, for a given value of n, the greater is the departure from 0.5, the greater is the degree of skewness.
领英推荐
Now let’s look at another important distribution of probability,
Poisson Distribution:
It is useful to describe the probability that a given event can happen within a given period (for instance, how many thoracic traumas could need the involvement of the thoracic surgeon in a day, or a week, etc.).
Poisson distribution is named after the French mathematician Denis Poisson. The events that may be described by this distribution have the following characteristics:
The events are independent from one another;
Within a given interval the event may present from 0 to infinite times;
The probability of an event to happen increases when the period of observation is longer.
To predict the probability, we must know how the events behave (this data comes from previous, or historical, observations of the same event before the time I am trying to perform my analysis). This parameter that is a mean of the events in a given interval, as derived from previous observations, is called λ. The Poisson distribution follows the following formula
where the number e is an important mathematical constant that is the base of the natural logarithm. It is approximately equal to 2.71828.
Let us try and understand this with an example, customer care center receives 100 calls per hour, 8 hours a day. As we can see that the calls are independent of each other. The probability of the number of calls per minute has a Poisson probability distribution. There can be any number of calls per minute irrespective of the number of calls received in the previous minute. Below is the curve of the probabilities for a fixed value of λ of a function following Poisson distribution:
If we are to find the probability that more than 150 calls could be received per hour, the call center could improve its standards on customer care by employing more services and catering to the needs of its customers, based on the understanding of the Poisson distribution.
Poisson Distribution Mean and Variance
For Poisson distribution, which has λ as the average rate, for a fixed interval of time, then the mean of the Poisson distribution and the value of variance will be the same. So for X following Poisson distribution, we can say that λ is the mean as well as the variance of the distribution.
Hence: E(X) = V(X) = λ
where
E(X) is the expected mean
V(X) is the variance
λ > 0
Properties of Poisson Distribution
The Poisson distribution is applicable in events that have a large number of rare and independent possible events. The following are the properties of the Poisson Distribution. In the Poisson distribution,
Applications of Poisson Distribution
There are various applications of the Poisson distribution. The random variables that follow a Poisson distribution are as follows:
? To count the number of defects of a finished product
? To count the number of deaths in a country by any disease or natural calamity
? To count the number of infected plants in the field
? To count the number of bacteria in the organisms or the radioactive decay in atoms
? To calculate the waiting time between the events.
Note: The binomial distribution refers only to discrete variables (that present a limited number of values within a given interval). However, in nature, many variables may present an infinite distribution of values, within a given interval. These are called continuous variables
The Exponential Distribution:
The exponential distribution is often concerned with the amount of time until some specific event occurs. For example, the amount of time (beginning now) until an earthquake occurs has an exponential distribution. Other examples include the length, in minutes, of long distance business telephone calls, and the amount of time, in months, a car battery lasts. It can be shown, too, that the value of the change that you have in your pocket or purse approximately follows an exponential distribution.
Values for an exponential random variable occur in the following way. There are fewer large values and more small values. For example, the amount of money customers spend in one trip to the supermarket follows an exponential distribution. There are more people who spend small amounts of money and fewer people who spend large amounts of money.
Example:
The number of days ahead travelers purchase their airline tickets can be modeled by an exponential distribution with the average amount of time equal to 15 days. Find the probability that a traveler will purchase a ticket fewer than ten days in advance. How many days do half of all travelers wait?
To find the probability that a traveler will purchase a ticket fewer than ten days in advance using an exponential distribution, we can use the cumulative distribution function (CDF) of the exponential distribution.
The cumulative distribution function of an exponential distribution with rate parameter ??
λ is given by: F(x)=1?e ^?λx, Where x is the time (number of days in this case) and λ is the rate parameter.
Given that the average time until purchase is 15 days, we can use the fact that the average of an exponential distribution is equal to 1/λ, so λ =1/15
Now, let's find the probability that a traveler will purchase a ticket fewer than ten days in advance:
P(X<10)=1?e ^(? 151) ×10
P(X<10)=1?e ^-2/3
P(X<10) ≈1?e ^0.6667
we find that e^?0.6667≈0.5134
P(X<10)≈1?0.5134
P(X<10)≈0.4866
So, the probability that a traveler will purchase a ticket fewer than ten days in advance is approximately 0.4866.
To find how many days half of all travelers wait, we need to find the median of the exponential distribution. The median of an exponential distribution is given by ln(2)/λ
Median= ln(2)/ λ
Median=ln(2)/1/15
Median=15ln(2)
ln(2) is approximately 0.6931.
Median≈15×0.6931
Median≈10.3965
So, approximately half of all travelers wait around 10.3965 days before purchasing their tickets.
Relation Between Poisson and Exponential Distribution
Poisson distribution deals with number of occurrences of an event in a time period whereas exponential distribution deals with the time between these events.