The Trick That Helps All Statisticians Survive

The Trick That Helps All Statisticians Survive

If, like me, you are not a fan of the code formatting in LinkedIn articles, you can view this article on Medium .

I have a fair coin and I am going to toss it 3 times. What is the probability that precisely one of those tosses result in heads? The answer is not too hard to calculate using our formula for the binomial distribution:

where n is the total number of events, k is the number of ‘successful’ events and p is the probability of a ‘successful’ event. In this case, there are 3 events, 1 ‘successful’ event (that is, a toss of heads), and since the coin is fair, the probability of a ‘successful’ event is 0.5. So our probability is 3?(0.5)1?(0.5)2 = 3?0.125 = 0.375. We can verify this using our favourite statistical software, R ??:

> choose(3,1)*0.5^3
[1] 0.375        

Great, pretty easy. Now what if we wanted to look at a situation where there were ten tosses and we want to get the probability that precisely three of them are heads. Then we have to calculate:

I’m not very interested in trying to calculate that manually, so let’s go straight to R:

> choose(10,3)*0.5^10
[1] 0.1171875        

Finally, let’s extend to 10,000 tosses, and we are interested in the probability that 4,500 of them are heads. There is absolutely no chance of a manual calculation here, and if we put this in R we get:

> choose(10000,4500)*0.5^10000
[1] NaN        

So now we have hit our computational limit in R, which tells us that absolutely precise calculations of binomial probabilities is not possible for large numbers of events.

Binomial distribution calculations in statistical software

Of course, today we live in the age of high performance computing, and as you’d expect, there are algorithms available which, though not able to calculate these distributions to absolute precisions, they can get very close. Close enough that the difference is negligible for pretty much all practical uses. For example, in R we can find our final probability above simply by using:

> dbinom(4500, size = 10000, prob = 0.5)
[1] 1.422495e-24        

Clearly the dbinom() function does not use the formula we tried earlier. Instead it uses a clever algorithm developed by Catherine Loader, a Stanford statistics PhD and career statistician who published her method along with the C code in 2000 while working for Bell Labs. The algorithm literally stores 15 hard-calculated values for a particular parameter and then calculates the distribution on the back of these, and it’s good enough to agree with the precise value up to 18 digits.

What did statisticians do in the good old days?

In the good old days before statisticians had access to high performance computing, Binomial probabilities were given in tables for small n. Here’s an example:

Table of Binomial densities for small n

But once you got past small n — usually for n > 20 — statisticians relied on the normal approximation of the Binomial distribution to estimate probability densities. Distributions like the Binomial distribution (and the Poisson distribution) are only defined for discrete integer values, but the normal distribution is a continuous distribution with a defined formula, allowing for a greater number of calculation and estimation options. You can see the discrete nature of the binomial distribution when you try to plot it as a continuous function:

library(ggplot2)

ggplot() +
  xlim(40, 60) +
  geom_function(fun = function(x) dbinom(x, 100, 0.5), color = "blue") +
  labs(x = "Successes", y = "Density") +
  theme_minimal()        

The normal approximation of the Binomial distribution

Let’s look at what happens when we overlay a normal distribution with a certain mean (50) and standard deviation (5) onto our previous plot:

ggplot() +
  xlim(40, 60) +
  geom_function(fun = function(x) dbinom(x, 100, 0.5), color = "blue") +
  geom_function(fun = function(x) dnorm(x, 50, 5), color = "red") +
  labs(x = "Successes", y = "Density") +
  theme_minimal()        

We see that, for the relevant integer values, the density calculated using a normal distribution with our chosen mean and standard deviation pretty much coincides with the density from the binomial distribution.

For the Binomial distribution, if n is the number of events and p is the probability of a successful event, then

  • we would expect np successful events. Hence we can regard np as our mean for this distribution.
  • each independent event (eg coin toss) has a variance of p(1-p), so by the sum of variance law the total variance is np(1-p) and so the standard deviation is √(np(1-p)).

Hence we can conclude that, for integer values, the binomial distribution B(n, p) can be approximated by the the normal distribution N(np, (np(1-p))). So what we have illustrated in the above diagram is that B(100, 0.5) ~ N(50, 5).

In fact this approximation is pretty accurate unless n is really small or the event is very biased (ie p is quite far away from 0.5). To test this, we can write a simple function that takes a number of events, a probability of a successful event, and a range of integer values to plot, and we can compare the binomial and normal densities:

test_norm <- function(size, prob, range) {
  df <- data.frame(
    n = range,
    binom = dbinom(range, size = size, prob = prob),
    norm = dnorm(range, mean = size*prob, sd = sqrt(size*prob*(1-prob)))
  )
  
  ggplot(data = df, aes(x = n)) +
    geom_line(y = df$binom, color = "red") +
    geom_line(y = df$norm, color = "blue") +
    theme_minimal()
}        

Now let’s test for 5, 10, 25 and 50 coin tosses with a fair coin:

library(patchwork)

for (n in c(5, 10, 50, 100)) {
  assign(paste0("p", n), test_norm(n, 0.5, (n/2-0.3*n):(n/2+0.3*n)))
}

p5 + p10 + p25 + p50        

We can see that, even with n as low as 10, the red and blue lines are almost coinciding. Let’s look at some biased coins with 100 coin tosses:

for (p in c(0.1, 0.2, 0.3, 0.4)) {
  assign(paste0("p", p), test_norm(100, p, (100*p-0.6*100*p):(100*p+0.6*100*p)))
}

p0.1 + p0.2 + p0.3 + p0.4        

We can see that, even for coins as biased as p = 0.2, the normal approximation is difficult to separate from the binomial distribution.

An example of how this is useful

A Cambridge University mathematics entrance question in 1991 posed the following question:

A fair coin is thrown 10,000 times. On each throw, 1 point is scored for a head and 1 point is lost for a tail. Find an approximation for the probability that the final score is greater than 100.

To answer this, we need there to be at least 100 more heads than tails in the 10,000 events, which means that the number of heads must be greater than 5,050. So we are being asked for P(x > 5050) in the binomial distribution B(10000, 0.5).

We can approximate B(10000, 0.5) to N(5000, 50). So effectively we are being asked for the area under the normal curve of N(5000, 50) for x ≥ 5051. It is advisable to use a continuity correction here to cope with the fact that the normal distribution is continuous but the binomial distribution is discrete, and therefore we should look for P(x > 5050.5).

We need the area under the curve to the right or the dotted red line

Our standard deviation is 50, so we are looking for the area under the normal curve to the right of 50.5/50 = 1.01 standard deviations above the mean, so our z-score is +1.01. We can use tables, an appropriate scientific calculator, or a function in R to calculate the appropriate upper-tail p-value for this z-score:

> pnorm(1.01, lower.tail = FALSE)
[1] 0.1562476        

Let's see if this agrees with the R function for calculating the p-value for a binomial distribution:

> pbinom(5050, 10000, 0.5, lower.tail = FALSE)
[1] 0.1562476        

A perfect match. And there we are — the probability that the score is greater than 100 is approximately 15.62%.

If you are interested in following this article with Python code instead of R code, Christian Janiake has written a gist here .


What did you think of this journey through some statistics fundamentals? Feel free to comment!

Christian Janiake

Data Scientist | Machine Learning Engineer

3 个月

Hi Keith amazing stuff! I took the liberty of migrating the code to python: https://gist.github.com/cjaniake/261bf0ef489b78d7cc08609572ca0ea3

Ed Powers

Customer Success leader and consultant

3 个月

Of course, you must assume you’re flipping the coin in a perfect vacuum… https://phys.org/news/2023-10-flipped-coins-fair-thought.html

Anthony Ciavarelli

Human Factor’s Psychologist

3 个月

Cool!

回复
Christina Phillips

Catalysing the journey to data driven cultures by design using Human Centric Analytics. Proud to be dyspraxic/dyslexic and tail end menopausal! I heat the sea!

3 个月

Nice post, thank you.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了