Path to Data science - Zero to Hero Series 1 - Week2

Path to Data science - Zero to Hero Series 1 - Week2

Let’s start our week2 exploration on distribution.

?? This week on our journey through data analysis, we're diving deeper into the world of distributions! Last time, we explored the significance of inference, the importance of statistics in data analysis, and touched on the probability of data. If you missed out on the previous article, be sure to catch up before delving into this week's content. Stay tuned for an enlightening exploration of different types of distributions and how they shape our understanding of data. Don't miss out! #DataAnalysis #Statistics #Probability #Distributions

https://www.dhirubhai.net/pulse/path-data-science-zero-hero-series-1-rahul-muralidharan-flbpc/?trackingId=nwHboNhMSguTDnh%2FPNYWpA%3D%3D

Normal Distribution:

Data can be "distributed" (spread out) in different ways such as below.

But there are many cases where the data tends to be around a central value with no bias left or right, and it gets close to a "Normal Distribution" like this: It is often called a "Bell Curve" because it looks like a bell.

The normal distribution, also called the Gaussian distribution, de Moivre distribution, or “bell curve,” is a probability distribution that is symmetric about a central mean: half of data falls to the left of the mean and half falls to the right. The bulk of data are clustered around the mean, which results in a bell-shaped curve when graphed.

The Normal Distribution has:

? mean = median = mode

? symmetry about the center

? 50% of values less than the mean and 50% greater than the mean

Many examples in real world tend to closely follow a Normal Distribution:

  • heights of people
  • size of things produced by machines
  • errors in measurements
  • blood pressure
  • marks on a test

The Standard Deviation is a measure of how spread out numbers are (read that page for details on how to calculate it).

In the above image:

The empirical rule tells us that:

? 68% of data falls within one standard deviation of the mean.

? 95% of data falls within two standard deviations of the mean.

? 99.7% of data falls within three standard deviations of the mean.

The empirical rule tells you what percentage of your data falls within a certain number of standard deviations from the mean. The standard deviation controls the spread of the distribution. A smaller standard deviation indicates that the data is tightly clustered around the mean, resulting in a taller and thinner normal distribution. A larger standard deviation indicates that the data is spread out around the mean; the normal distribution will be flatter and wider.

Binomial Distribution:

A classic example of probability distribution is the binomial distribution. It is the representation of the probability when only two events may happen, that are mutually exclusive.


The probability distribution becomes a binomial probability distribution when it meets the following requirements.

  • Each trial can have only two outcomes or the outcomes that can be reduced to two outcomes. These outcomes can be either a success or a failure.
  • The trails must be a fixed number.
  • The outcome of each trial must be independent of each others.
  • And the success of probability must remain the same for each trial.

Binomial Distribution Formula in Probability

The formula for the binomial probability distribution is as stated below:

P(x) = nCx · px (1 ? p)n?x (or) P(r) = [n!/r!(n?r)!]· pr (1 ? p)n?r

Where,

n = Total number of events

r (or) x = Total number of successful events.

p = Probability of success on a single trial.

nCr = [n!/r!(n?r)]!

1 – p = Probability of failure.

Let’s work out an example:

The probability that a person can achieve a target is 3/4. The count of tries is 5. What is the probability that he will attain the target at least thrice?

Solution:

Given that, p = ?, q = ?, n = 5.

Using binomial distribution formula, we get P(X) = nCx · px (1 ? p)n?x

Thus, the required probability is: P(X = 3) + P(X=4) + P(X=5)

= 5C3 · (?)3 (? )2 + 5C4 · (?)4 (? )1 +5C5 · (?)5

= 459/512.

Therefore, the probability that the person will attain the target atleast thrice is 459/512.

Another example: Number of Fraudulent Transactions

Banks use the binomial distribution to model the probability that a certain number of credit card transactions are fraudulent.

For example, suppose it is known that 2% of all credit card transactions in a certain region are fraudulent. If there are 50 transactions per day in a certain region, we can use a Binomial Distribution Calculator to find the probability that more than a certain number of fraudulent transactions occur in a given day:

P(X > 1 fraudulent transaction) = 0.26423

P(X > 2 fraudulent transactions) = 0.07843

P(X > 3 fraudulent transactions) = 0.01776

And so on.

This gives banks an idea of how likely it is that more than a certain number of fraudulent transactions will occur in a given day.

Properties of Binomial Distribution

  • Binomial distribution has a fixed number of independent trials; i.e., n.
  • In each trial, there are only two outcomes, success or failure.
  • The probability of success (p) remains constant across all trials.
  • Each trial is independent, with no impact on others.
  • It is a discrete probability distribution with specific, countable values.
  • Probability Distribution Function (PDF) calculates probabilities for ‘x’ successes in ‘n’ trials.
  • Mean (μ) equals np, and Variance (σ2) equals npq.
  • The shape of the binomial curve varies based on ‘n’ and ‘p,’ tending towards symmetry with larger ‘n.’
  • For large ‘n,’ it approximates a normal distribution (Central Limit Theorem).
  • Cumulative Distribution Function (CDF) finds cumulative probabilities for ≤ ‘x’ successes.

Shape of Binomial Distribution

Binomial Distribution may be symmetrical or skewed. If the probability of success, p, is equal to 0.5, then the binomial distribution would be symmetrical, regardless of the value of n. If p < 0.5, the distribution will be positively skewed; while for p > 0.5, the distribution will be negatively skewed. Further, for a given value of n, the greater is the departure from 0.5, the greater is the degree of skewness.



Now let’s look at another important distribution of probability,

Poisson Distribution:

It is useful to describe the probability that a given event can happen within a given period (for instance, how many thoracic traumas could need the involvement of the thoracic surgeon in a day, or a week, etc.).

Poisson distribution is named after the French mathematician Denis Poisson. The events that may be described by this distribution have the following characteristics:

The events are independent from one another;

Within a given interval the event may present from 0 to infinite times;

The probability of an event to happen increases when the period of observation is longer.

To predict the probability, we must know how the events behave (this data comes from previous, or historical, observations of the same event before the time I am trying to perform my analysis). This parameter that is a mean of the events in a given interval, as derived from previous observations, is called λ. The Poisson distribution follows the following formula


where the number e is an important mathematical constant that is the base of the natural logarithm. It is approximately equal to 2.71828.

Let us try and understand this with an example, customer care center receives 100 calls per hour, 8 hours a day. As we can see that the calls are independent of each other. The probability of the number of calls per minute has a Poisson probability distribution. There can be any number of calls per minute irrespective of the number of calls received in the previous minute. Below is the curve of the probabilities for a fixed value of λ of a function following Poisson distribution:

If we are to find the probability that more than 150 calls could be received per hour, the call center could improve its standards on customer care by employing more services and catering to the needs of its customers, based on the understanding of the Poisson distribution.

Poisson Distribution Mean and Variance

For Poisson distribution, which has λ as the average rate, for a fixed interval of time, then the mean of the Poisson distribution and the value of variance will be the same. So for X following Poisson distribution, we can say that λ is the mean as well as the variance of the distribution.

Hence: E(X) = V(X) = λ

where

E(X) is the expected mean

V(X) is the variance

λ > 0

Properties of Poisson Distribution

The Poisson distribution is applicable in events that have a large number of rare and independent possible events. The following are the properties of the Poisson Distribution. In the Poisson distribution,

  • The events are independent.
  • The average number of successes in the given period of time alone can occur. No two events can occur at the same time.
  • The Poisson distribution is limited when the number of trials n is indefinitely large.
  • mean = variance = λ
  • np = λ is finite, where λ is constant.
  • The standard deviation is always equal to the square root of the mean μ.
  • The exact probability that the random variable X with mean μ =a is given by P(X= a) = μa / a! e -μ
  • If the mean is large, then the Poisson distribution is approximately a normal distribution.

Applications of Poisson Distribution

There are various applications of the Poisson distribution. The random variables that follow a Poisson distribution are as follows:

? To count the number of defects of a finished product

? To count the number of deaths in a country by any disease or natural calamity

? To count the number of infected plants in the field

? To count the number of bacteria in the organisms or the radioactive decay in atoms

? To calculate the waiting time between the events.

Note: The binomial distribution refers only to discrete variables (that present a limited number of values within a given interval). However, in nature, many variables may present an infinite distribution of values, within a given interval. These are called continuous variables

The Exponential Distribution:

The exponential distribution is often concerned with the amount of time until some specific event occurs. For example, the amount of time (beginning now) until an earthquake occurs has an exponential distribution. Other examples include the length, in minutes, of long distance business telephone calls, and the amount of time, in months, a car battery lasts. It can be shown, too, that the value of the change that you have in your pocket or purse approximately follows an exponential distribution.

Values for an exponential random variable occur in the following way. There are fewer large values and more small values. For example, the amount of money customers spend in one trip to the supermarket follows an exponential distribution. There are more people who spend small amounts of money and fewer people who spend large amounts of money.

Example:

The number of days ahead travelers purchase their airline tickets can be modeled by an exponential distribution with the average amount of time equal to 15 days. Find the probability that a traveler will purchase a ticket fewer than ten days in advance. How many days do half of all travelers wait?

To find the probability that a traveler will purchase a ticket fewer than ten days in advance using an exponential distribution, we can use the cumulative distribution function (CDF) of the exponential distribution.

The cumulative distribution function of an exponential distribution with rate parameter ??

λ is given by: F(x)=1?e ^?λx, Where x is the time (number of days in this case) and λ is the rate parameter.

Given that the average time until purchase is 15 days, we can use the fact that the average of an exponential distribution is equal to 1/λ, so λ =1/15

Now, let's find the probability that a traveler will purchase a ticket fewer than ten days in advance:

P(X<10)=1?e ^(? 151) ×10

P(X<10)=1?e ^-2/3

P(X<10) ≈1?e ^0.6667

we find that e^?0.6667≈0.5134

P(X<10)≈1?0.5134

P(X<10)≈0.4866

So, the probability that a traveler will purchase a ticket fewer than ten days in advance is approximately 0.4866.

To find how many days half of all travelers wait, we need to find the median of the exponential distribution. The median of an exponential distribution is given by ln(2)/λ

Median= ln(2)/ λ

Median=ln(2)/1/15

Median=15ln(2)

ln(2) is approximately 0.6931.

Median≈15×0.6931

Median≈10.3965

So, approximately half of all travelers wait around 10.3965 days before purchasing their tickets.

Relation Between Poisson and Exponential Distribution

Poisson distribution deals with number of occurrences of an event in a time period whereas exponential distribution deals with the time between these events.



要查看或添加评论,请登录

RAHUL Muralidharan的更多文章

其他会员也浏览了