Be confident with confidence intervals
Abhi Sharma
Cloud Engineer @ Google | Ex-Founder | Building on Entrepreneurial Foundations
In continuation to my last blog on statistics - Data analytics is all about Statistics where we saw various probability distributions, In this article we are going to see what are confidence intervals. This concept comes under a very special branch of statistics called Inferential Statistics, where we try to draw inferences about the population just by analysing samples
Central Limit Theorem
Central Limit Theorem or CLT forms the basis of all the inferential statistics therefore it is very important to understand central limit theorem before moving ahead.
Sampling Distribution is a distribution of a measure derived from a large number of samples having fixed sample size say n which are drawn from a population. Sampling distribution of means is a distribution of means of various samples drawn from a population. Lets assume, We have a population of 20 million students with an average height of 5.8 feet. If we draw a sample of 50 students and calculate mean for those 50 students which comes out to be 6.1 feet. Now we draw 10,000 such samples and calculate means for those samples then the distribution that we get is called sampling distribution of the means.
Central Limit Theorem states that sampling distribution of means will always be normally distributed if the sample size is large enough. It does not matter whatever distribution population follows, sampling distribution of means will always be a normal distribution.
Properties of central limit theorem
Confidence Intervals
It is not an easy task to infer population parameter using a sample. There is always a chance of error therefore concept of confidence intervals is introduced.
Confidence Interval is a range of values(upper bound and lower bound) within which a point estimate like mean or standard deviation lies with a certain level of confidence.
In the last blog we saw normal distribution and standard normal distribution, in this blog we understood the significance of central limit theorem. Now, we are going to see some applications of these concepts while finding confidence intervals.
Lets say, you want to infer population mean from the sample mean. We know that we can apply central limit theorem to find population mean but in this case we only have one sample. We certainly can not say that the sample mean is population mean. However, we know CLT and it applies to all kinds of distributions. Assume your sample mean comes out to be 132.5 and we say that the population mean would lie somewhere between 128.5 to 136.5 with 95% confidence level. Isn't this interesting? Lets' see how can we calculate confidence intervals with certain confidence level.
Point Estimate - It is a single value measure which is also called population parameter like mean, standard deviation etc.
Confidence Interval - Range of values within which a population parameter is expected to lie
Confidence Level - It shows the conviction of a population parameter lying within a confidence interval. CL is presented in percentage form. Most common CLs are 90%, 95% and 99%.
Significance Level - It shows the probability that the event has happened by chance. It is denoted by greek letter alpha α.
Confidence intervals when population variance is known
Suppose we have sample of 30 data scientist salaries from Gurgaon and we want to infer average data scientist salary of the entire city. We have been given the population variance σ2. Now by CLT, we know that standard error or standard deviation for sampling distribution is σ/√n.
Since we have one sample of n=30 and we know population variance, we can estimate the confidence interval by
[ x? - Z-value of α/2 * (σ/√n) , x? + Z-value of α/2 * (σ/√n)]
We can find the z-values from the z-table. For confidence interval 90%, α = 0.1 and z-value for α/2 = 1.65. Similarly, for 95% and 99% corresponding z values are 1.96 and 2.58 respectively.
Note - We can use z-values for finding confidence intervals when we know population variance and number of data points is equal to or above 30.
Confidence Intervals when population variance is unknown
Alright, Now the problem is how to find the confidence interval when we don't know population variance. The solution is student's T distribution (For refresher refer this blog ). We can use t-values to find confidence intervals for a smaller sample and unknown population variance by using the below formula
[ x? - t-value(n-1,α/2) * (s/√n) , x? + t-value(n-1,α/2) * (s/√n)]
Here, x? is sample mean, s = sample standard deviation, n = sample size, α = significance level and t-value(n-1,α/2) is t value for n-1 degree of freedom.
we can use t-table to find the t-value.
We can use t-value method when the population variance is unknown however t-values after 30 degree of freedom resembles z-values so it is advisable to use student's t statistics to find confidence interval when sample size is smaller than 30.
Refer the kaggle notebooks for CLT and z-statistics and for student's T distribution
Some advanced concepts for confidence intervals
So far we dealt with a single sample and calculated the confidence intervals or means but that is not the case always. In real life, we may have to analyse two samples and compare them. These can be classified as Dependent and Independent Samples. Dependent samples are basically same sample but considered twice e.g sample of patients who were tested for blood sugar levels before and after a medication. Independent samples are totally different samples into consideration e.g sample of test scores of engineering and management students.
Now, we will see how confidence intervals for these different types of multiple samples are calculated.
Confidence Intervals for dependent samples
If we have a sample of 10 patients who were measured for blood sugar levels before and after giving certain drug. we can use the below method to analyse the effect of drug on this sample.
[ x? - t-value(n-1,α/2) * (s/√n) , x? + t-value(n-1,α/2) * (s/√n)]
领英推荐
Confidence Intervals for Independent samples with known population variance
Lets assume we have a sample of grades of 50 engineering students and 40 management students and we want to find the confidence interval for these 2 samples.
Calculate confidence interval
[ mean difference - z-value(α/2) * sqrt(σ2) , mean difference + z-value(α/2) * sqrt(σ2)]
Confidence Intervals for Independent samples with unknown population variance - assumed to be equal
When samples are independent and population variance is unknown but assumed to be equal, we can use the concept of pooled sample variance for our calculations.
Since the variance in this case is unknown therefore we are going to use student's t statistics method for confidence intervals.
Confidence Intervals for Independent samples with unknown population variance - assumed to be unequal
This is similar to comparing apple and oranges which is rarely used in real life however for the sake of covering all the topics related to confidence intervals we are going to discuss this as well.
We will dive straight to the formulas which will come in handy for you to do the maths. Formula to find confidence interval is as follows:
Here, instead of pooled sample variance we are using sample variance for both the samples separately. Another thing to note is DF which is degree of freedom which can be calculated by another formula:
Conclusion
In this article, we covered a very important topic of inferential statistics that is confidence intervals. We also discussed what is central limit theorem and how it made our life easy while calculating confidence intervals. If you are an aspiring data scientist, these are the most fundamental concepts one should know.
In the coming blogs, I will cover the most demanding topic which is hypothesis testing. Next week, I will also be talking about Generative AI in a separate blog and video so follow this newsletter to stay updated about my new articles.
See you soon! Have a nice week ahead!
Author?: Abhi Sharma -?Linkedin