Concise Basic Stats - Part III: Normal and T Distributions

Hello, welcome back to another article in the concise stats series. As you see in the title, this time we will have something really exciting to talk about. We are going to talk about two of the most important probability distributions in statistics for independent, random variables.

The Normal Distribution

The origin of Normal distribution can be traced to a French mathematician?Abraham de Moivre. He had scientific interest in gambling and often acted as a consultant to gamblers to determine probabilities. De Moivre was studying the probability distribution of coin flips. He was trying to come up with a mathematical expression such as finding a probability of 60 or more tails out of one hundred coin flips. As an answer to this question he derived a bell shaped distribution which is commonly referred as the normal curve.

We say that the?Normal distribution?is perhaps the most important probability distribution due to its versatility in describing values for many natural phenomena. Examples include distribution of heights, blood pressure, measurement errors, IQ scores etc etc etc… they all follow a normal distribution.

To get the intuitive notion behind this, let’s take the example of the height of the human?population. When you think of how tall people are, in general, how could you describe the distribution of such values for a population? You can think of this by perhaps restricting your population to be, for instance, people from your classroom, or workspace colleagues, or restrict by age or gender. How many of those people would be considered to be “average“ height and how many could be considered?“very short” or “very tall”? The simple notion of having extremes or exceptional cases is derived from our understanding of what is considered to be “normal”, or average. There cannot be people considered to be exceptionally tall or very short if it wasn’t for the fact that the?majority?of people tend to be considered “average” height.

No alt text provided for this image
The Bell-shaped (Normal) distibution

Now, referring back to the normal distribution: take the x-axis to be the scale of values for height, from very-short to very tall. The y-axis is of course the frequency of observations (number of people) that fall in each value bracket. Intuitively we know that the majority of people will fall around the mean, or the middle part, thus making it a taller region (around the mean). Those who are exceptional in terms of height (either very short or very tall) will be increasingly rare as we approach the tail of the distribution, thus explaining the low frequency of observations in those areas.

Empirical Rule

As we saw, the standard deviation is a measure of variability. In the case of normal distribution, it is one of the two parameters (the other being the mean) that we use to define it. The standard deviation dictates the width of the normal distribution. It also determines how far from the mean the values tend to fall. As a recap: it represents the?typical?distance between the observations and the average. Changing the standard deviation either tightens or spreads out the width of the distribution along the x-axis (recall the notion of kurtosis). Larger standard deviations produce wider distributions. See the example below

import ipywidgets as widget
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

def viz(sample:int, std: int):
? ? rv = stats.norm(loc=0, scale=std)
? ? sample_values = rv.rvs(size=sample)
? ? sns.displot(pd.Series(sample_values))
? ? plt.title(f'sample size: {sample}; std: {std}')
? ? plt.xlim(-100, 100)
? ??
? ? viz,
? ? sample = widgets.IntSlider(value=100, max=1000),
? ? std = widgets.IntSlider(value=1)
No alt text provided for this image
The range of of the x values increase as I tweak the standard deviation parameter of a normally-distributed r.v.

For the notion of empirical rule, the standard deviation becomes particularly valuable. By knowing our value for standard deviation, we can use it to determine the proportion of values that fall within certain distances from the mean. When data is normally distributed:

  • 68%?of the observations fall within?+/- 1?standard deviation from the mean
  • 95%?of the observations fall within?+/- 2?standard deviation from the mean
  • 99.7%?of the observations fall within?+/- 3?standard deviation from the mean

As we will see in later chapters, most of the traditional?hypothesis tests?as well as?analysis of variance?have the assumption of the underlying data being normally distributed. Therefore it is of utmost importance to carry a background check in the assumptions of the particular task we are applying as well as the distribution from which our data comes from. We will explore?normality tests?that will help us answer those questions. In the case of non-normally distributed data, we will also explore ways to?transform?our data in order to have something more bell-shaped. As we saw in an earlier chapter, the normal distribution shape has the property of being symmetrical around its mean, with most of the observations cluster around the central peak, and the probabilities for values further away from the mean diminishes equally for both sides as we wander far from its center.

No alt text provided for this image
The visualizing empirical rule in a standard normal distribution.

The equation which defines the outline of the shape above is called the density function.

No alt text provided for this image
The probability density function (p.d.f.) of a normally distributed r.v.

The problem with the above equation is that it is not easy to integrate. The reason why we want to integrate it is to be able to find the region under the curve between values, thus giving us its probability.

Normal Approximations

Another reason for the importance of the normal distribution comes from its versatility when approximating other distributions. Let’s take a look at two such cases with the Binomial and Poisson distributions.

Binomial Approximation

In most cases, working out a problem using the normal distribution may be easier than using a Binomial distribution, due to the complexity of computing the probability values. The normal distribution can be used as an approximation to the binomial distribution, if the following conditions are met:

  • For X ~ B(n,p). i.e X is a r.v following a Binomial distribution, with parameters?n?(sample size) and?p?given probability of success. in order to approximate using the normal distribution, we must check if n *p and n*q, (where q = 1-p) is?greater or equal?to 5. If that is the case, then X is approximately N(np, npq)

import ipywidgets as widget
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

def viz(n_values:int):
? ??
? ? # Generate a Binomial r.v. with n=15, p.5 and sample
? ? n = 15
? ? p = .5
? ? rv = binom(n=n,p=p)
? ? value_list = rv.rvs(n_values)

? ??
? ? # Display the means in a Barplot
? ? sns.displot(
? ? ? ? pd.Series(value_list),
? ? ? ? kde=True,
? ? ? ? color=list(plt.rcParams['axes.prop_cycle'])[5]['color']
? ? ? ? ? ? ? ?)
? ? plt.title(f'Binom({n}, {p}): {n_values} values;')
? ? plt.xlabel('Values')
? ? plt.xlim(0, 15)
? ? plt.ylim(0, 130)
? ??
? ? viz,
? ? n_values = widgets.IntSlider(value=10,step=10, max=600),? ?
No alt text provided for this image
Normal approximation of the binomial distribution. Note hoe the mean of the distribution is around 7.5. Which is the same value for n * p

Poisson Distribution

We can also use the normal distribution to approximate the Poisson distribution. For large values of λ (the mean of the Poisson distribution and its only parameter) we can say that

if X ~ Poisson(λ) then for large values of lambda (use as rule of thumb λ > 30), X ~ N(λ, λ) approximately. That is, we approximate a Poisson through a normal distribution by using the value of lambda as both of the input parameters of the normal.

Continuity Correction

If you have been paying attention to the different kinds of probabilities distributions we have presented so far, you must’ve realized that both Binomial and Poisson distributions are?discrete probabilities distributions. Therefore, we are approximating discrete distributions using a?continuous?one (the Normal). As you might think, such approximations cannot be made without any adjustment or “correction” in terms of how each probability value is obtained. That is where continuity correction comes into play.

In the discrete case, each probability is represented by a rectangle. Take the binomial distribution. In the x-axis you have the (discrete) values which the r.v. X can take. The height of each of the rectangles correspond to the probability of the outcome to occur (which is equal to the area of the rectangle, since it has width 1). So, if we wish to use the normal distribution to approximate a probability in the Binomial distribution, we have to make sure we are covering the whole rectangle bar, by either summing or subtracting 0.5 from the cutoff points of the intervals we want to calculate the area for. For instance, let’s say we wish to calculate the probability of obtaining 18 to 23 heads in 40 trials of a fair coin toss. Using the conventional Binomial distribution, we would formulate this problem as finding the value for P(18 <= X <= 23). Here we are taking the whole rectangles since each of the values for X correspond to an entire bar. However, using the Normal approximation for the binomial, we would need to use continuity correction to account for each of the side edges of our interval-defining rectangles (otherwise we would be only considering from the middle point of the rectangle onwards). Therefore, in this example, we would need to formulate the desired probability as P(17.5 < X < 23.5) in the normal distribution.

No alt text provided for this image

Of course, in practice we don’t need to be worrying about applying continuity correction actively, since it will be already taken care of by the underlying software performing the computations. This is just to illustrate how we can approximate a binomial (discrete) distribution using a normal (continuous) distribution, and how this relationship plays out.

Z-Scores and the Standard Normal Distribution

In practice, when we want to model our data after the normal distribution, we will be working with the?Standard Normal Distribution. The difference is that the Standard Normal Distribution has its values scaled in a way that it remains with mean 0 and variance equal to 1. We do that because it is easier to obtain a value for the area below the line with a standard normal through?numerical integration.

A value on the standard normal distribution is known as a standard score or a Z-score. Z-scores represent the number of standard deviations above or below the mean that a specific observation falls. For example, a standard score of 2.5 indicates that the observation is 2.5 standard deviations above the mean. A negative value, on the other hand, represents a value below the average (remember that the mean of a standard normal distribution is 0). Not only can Z-scores help us calculate probabilities but also allow us to compare two scores that are from different normal distributions, since they are always centered in the same mean and have the same variance. Now, let’s see how we can take the measurement from any Normal distribution and convert it to a Z-score (standard Normal).?In order to do that, we are going to perform a simple calculation:

No alt text provided for this image

That is, we take our array of values, calculate its mean and standard deviation and then, for each of them, deduct its value from the mean and scale by the standard deviation. At the end of this operation, your new values will be as if it came from a standard normal distribution! This process is called?standardization.

from scipy import stats

T Student Distribution

In the real world however, most of the time we will end up using a distribution called the?t-student or t distribution. That is because, as we saw, in order to use the Normal Distribution, we would need to know the population?standard deviation parameter. In most cases, however, such value is unknown or cannot be easily estimated through our sample. That is when the T distribution will come in handy. Another scenario where we would prefer to use the t-distribution instead of the normal is when our sample size is small (a popular rule of thumb is less than 30), which hinders our ability to accurately approximate a normal distribution.

No alt text provided for this image

As you can see in the image above, the smaller our sample size is, the less it resembles the normal distribution. The t-distribution is less peaked than the normal distribution at the center and higher peaked in the tails.

The t-distribution was published by William Gosset, in 1908. He published under the pseudonym 'Student" because he was working for the Guinness brewery and they did not, according to legend, allow him to use his own name!

In order to transform our data into values that can be compared and tested via the t-distribution we must apply the following:

No alt text provided for this image
the t- statistic is obtained by using the scaled sample standard deviation (s) instead of the unknown parameter sigma.

Degrees of freedom

Recall that when we were talking about the normal distribution that we saw that its density function can be described by only the mean and the variance. In the case of t distributions, its shape takes into account the number of?degrees of freedom.

In statistics, degrees of freedom refers to the number of values in the final calculation which are free to vary. For Example: imagine we previously know the mean for an array of 5 values. if I tell you the values for 4 of them, I would not need to tell you what the 5th one is. You would be able to deduct that number quite easily. Therefore the degrees of freedom in this example would be equal to 4. Therefore, the degrees of freedom is a combination of how much data you have and how many parameters you need to estimate. When it comes to t-distribution, our degrees of freedom will be given by our sample size n minus 1, so n-1. Intuitively, the more data in our sample we have, the more precise our estimates will be (generally), and therefore we want to have many degrees of freedom.

Thanks for following all the way through this post, and I hope you had a chance to discover something new, or at least helped solidify some concepts. If you liked, please share and give it a reaction. Constructive criticism is also always appreciated. I Hope to see you in a next post. Godspeed!


