Be confident with confidence intervals

Be confident with confidence intervals

In continuation to my last blog on statistics - Data analytics is all about Statistics where we saw various probability distributions, In this article we are going to see what are confidence intervals. This concept comes under a very special branch of statistics called Inferential Statistics, where we try to draw inferences about the population just by analysing samples

Central Limit Theorem

Central Limit Theorem or CLT forms the basis of all the inferential statistics therefore it is very important to understand central limit theorem before moving ahead.

Sampling Distribution is a distribution of a measure derived from a large number of samples having fixed sample size say n which are drawn from a population. Sampling distribution of means is a distribution of means of various samples drawn from a population. Lets assume, We have a population of 20 million students with an average height of 5.8 feet. If we draw a sample of 50 students and calculate mean for those 50 students which comes out to be 6.1 feet. Now we draw 10,000 such samples and calculate means for those samples then the distribution that we get is called sampling distribution of the means.

Central Limit Theorem states that sampling distribution of means will always be normally distributed if the sample size is large enough. It does not matter whatever distribution population follows, sampling distribution of means will always be a normal distribution.

Properties of central limit theorem

  1. Mean of sampling distribution is approximately equal to population mean. x? = μ?
  2. Standard deviation of sampling distribution is also called standard error that is equal to standard deviation of population divided by square root of sample size. SE = σ /?√n
  3. Larger the samples, more symmetrical normal distribution.
  4. Sample size should be greater than or equal to 30 for CLT to apply.

Confidence Intervals

It is not an easy task to infer population parameter using a sample. There is always a chance of error therefore concept of confidence intervals is introduced.

Confidence Interval is a range of values(upper bound and lower bound) within which a point estimate like mean or standard deviation lies with a certain level of confidence.

In the last blog we saw normal distribution and standard normal distribution, in this blog we understood the significance of central limit theorem. Now, we are going to see some applications of these concepts while finding confidence intervals.

Lets say, you want to infer population mean from the sample mean. We know that we can apply central limit theorem to find population mean but in this case we only have one sample. We certainly can not say that the sample mean is population mean. However, we know CLT and it applies to all kinds of distributions. Assume your sample mean comes out to be 132.5 and we say that the population mean would lie somewhere between 128.5 to 136.5 with 95% confidence level. Isn't this interesting? Lets' see how can we calculate confidence intervals with certain confidence level.

Point Estimate - It is a single value measure which is also called population parameter like mean, standard deviation etc.

Confidence Interval - Range of values within which a population parameter is expected to lie

Confidence Level - It shows the conviction of a population parameter lying within a confidence interval. CL is presented in percentage form. Most common CLs are 90%, 95% and 99%.

Significance Level - It shows the probability that the event has happened by chance. It is denoted by greek letter alpha α.

Confidence intervals when population variance is known

Suppose we have sample of 30 data scientist salaries from Gurgaon and we want to infer average data scientist salary of the entire city. We have been given the population variance σ2. Now by CLT, we know that standard error or standard deviation for sampling distribution is σ/√n.

Since we have one sample of n=30 and we know population variance, we can estimate the confidence interval by

[ x? - Z-value of α/2 * (σ/√n) , x? + Z-value of α/2 * (σ/√n)]

We can find the z-values from the z-table. For confidence interval 90%, α = 0.1 and z-value for α/2 = 1.65. Similarly, for 95% and 99% corresponding z values are 1.96 and 2.58 respectively.

No alt text provided for this image
Z-table

Note - We can use z-values for finding confidence intervals when we know population variance and number of data points is equal to or above 30.


Confidence Intervals when population variance is unknown

Alright, Now the problem is how to find the confidence interval when we don't know population variance. The solution is student's T distribution (For refresher refer this blog ). We can use t-values to find confidence intervals for a smaller sample and unknown population variance by using the below formula

[ x? - t-value(n-1,α/2) * (s/√n) , x? + t-value(n-1,α/2) * (s/√n)]

Here, x? is sample mean, s = sample standard deviation, n = sample size, α = significance level and t-value(n-1,α/2) is t value for n-1 degree of freedom.

we can use t-table to find the t-value.

No alt text provided for this image
T table

We can use t-value method when the population variance is unknown however t-values after 30 degree of freedom resembles z-values so it is advisable to use student's t statistics to find confidence interval when sample size is smaller than 30.

Refer the kaggle notebooks for CLT and z-statistics and for student's T distribution


Some advanced concepts for confidence intervals

So far we dealt with a single sample and calculated the confidence intervals or means but that is not the case always. In real life, we may have to analyse two samples and compare them. These can be classified as Dependent and Independent Samples. Dependent samples are basically same sample but considered twice e.g sample of patients who were tested for blood sugar levels before and after a medication. Independent samples are totally different samples into consideration e.g sample of test scores of engineering and management students.

Now, we will see how confidence intervals for these different types of multiple samples are calculated.

Confidence Intervals for dependent samples

If we have a sample of 10 patients who were measured for blood sugar levels before and after giving certain drug. we can use the below method to analyse the effect of drug on this sample.

  1. Calculate the difference between the 2 readings of blood sugar levels for both the samples
  2. find the mean and standard deviation of the differences.
  3. Since the sample size is less than 30 and we don't know population variance of the differences, we can use student's t distribution concept to get the confidence interval with a certain confidence level.

[ x? - t-value(n-1,α/2) * (s/√n) , x? + t-value(n-1,α/2) * (s/√n)]

Confidence Intervals for Independent samples with known population variance

Lets assume we have a sample of grades of 50 engineering students and 40 management students and we want to find the confidence interval for these 2 samples.

  1. Calculate the difference between in the means grades for both of the streams separately.
  2. Find the difference of the variances between the two using the formula:

No alt text provided for this image
Difference of variance

Calculate confidence interval

[ mean difference - z-value(α/2) * sqrt(σ2) , mean difference + z-value(α/2) * sqrt(σ2)]

Confidence Intervals for Independent samples with unknown population variance - assumed to be equal

When samples are independent and population variance is unknown but assumed to be equal, we can use the concept of pooled sample variance for our calculations.

No alt text provided for this image
Pooled sample variance for 2 samples

Since the variance in this case is unknown therefore we are going to use student's t statistics method for confidence intervals.

No alt text provided for this image
Confidence Interval for Indepdent samples with unknown but equal variance

Confidence Intervals for Independent samples with unknown population variance - assumed to be unequal

This is similar to comparing apple and oranges which is rarely used in real life however for the sake of covering all the topics related to confidence intervals we are going to discuss this as well.

We will dive straight to the formulas which will come in handy for you to do the maths. Formula to find confidence interval is as follows:

No alt text provided for this image
Confidence Interval for independent samples with unknown and unequal variance

Here, instead of pooled sample variance we are using sample variance for both the samples separately. Another thing to note is DF which is degree of freedom which can be calculated by another formula:

No alt text provided for this image
Degree of freedom


Conclusion

In this article, we covered a very important topic of inferential statistics that is confidence intervals. We also discussed what is central limit theorem and how it made our life easy while calculating confidence intervals. If you are an aspiring data scientist, these are the most fundamental concepts one should know.

In the coming blogs, I will cover the most demanding topic which is hypothesis testing. Next week, I will also be talking about Generative AI in a separate blog and video so follow this newsletter to stay updated about my new articles.

See you soon! Have a nice week ahead!



Author?: Abhi Sharma -?Linkedin















要查看或添加评论,请登录

Abhi Sharma的更多文章

社区洞察

其他会员也浏览了