登录查看更多内容

Be confident with confidence intervals

Abhi Sharma

Cloud Engineer @ Google | Ex-Founder | Building on Entrepreneurial Foundations

发布日期: 2023年6月16日

In continuation to my last blog on statistics - Data analytics is all about Statistics where we saw various probability distributions, In this article we are going to see what are confidence intervals. This concept comes under a very special branch of statistics called Inferential Statistics, where we try to draw inferences about the population just by analysing samples

Central Limit Theorem

Central Limit Theorem or CLT forms the basis of all the inferential statistics therefore it is very important to understand central limit theorem before moving ahead.

Sampling Distribution is a distribution of a measure derived from a large number of samples having fixed sample size say n which are drawn from a population. Sampling distribution of means is a distribution of means of various samples drawn from a population. Lets assume, We have a population of 20 million students with an average height of 5.8 feet. If we draw a sample of 50 students and calculate mean for those 50 students which comes out to be 6.1 feet. Now we draw 10,000 such samples and calculate means for those samples then the distribution that we get is called sampling distribution of the means.

Central Limit Theorem states that sampling distribution of means will always be normally distributed if the sample size is large enough. It does not matter whatever distribution population follows, sampling distribution of means will always be a normal distribution.

Properties of central limit theorem

Mean of sampling distribution is approximately equal to population mean. x? = μ?
Standard deviation of sampling distribution is also called standard error that is equal to standard deviation of population divided by square root of sample size. SE = σ /?√n
Larger the samples, more symmetrical normal distribution.
Sample size should be greater than or equal to 30 for CLT to apply.

Confidence Intervals

It is not an easy task to infer population parameter using a sample. There is always a chance of error therefore concept of confidence intervals is introduced.

Confidence Interval is a range of values(upper bound and lower bound) within which a point estimate like mean or standard deviation lies with a certain level of confidence.

In the last blog we saw normal distribution and standard normal distribution, in this blog we understood the significance of central limit theorem. Now, we are going to see some applications of these concepts while finding confidence intervals.

Lets say, you want to infer population mean from the sample mean. We know that we can apply central limit theorem to find population mean but in this case we only have one sample. We certainly can not say that the sample mean is population mean. However, we know CLT and it applies to all kinds of distributions. Assume your sample mean comes out to be 132.5 and we say that the population mean would lie somewhere between 128.5 to 136.5 with 95% confidence level. Isn't this interesting? Lets' see how can we calculate confidence intervals with certain confidence level.

Point Estimate - It is a single value measure which is also called population parameter like mean, standard deviation etc.

Confidence Interval - Range of values within which a population parameter is expected to lie

Confidence Level - It shows the conviction of a population parameter lying within a confidence interval. CL is presented in percentage form. Most common CLs are 90%, 95% and 99%.

Significance Level - It shows the probability that the event has happened by chance. It is denoted by greek letter alpha α.

Confidence intervals when population variance is known

Suppose we have sample of 30 data scientist salaries from Gurgaon and we want to infer average data scientist salary of the entire city. We have been given the population variance σ2. Now by CLT, we know that standard error or standard deviation for sampling distribution is σ/√n.

Since we have one sample of n=30 and we know population variance, we can estimate the confidence interval by

[ x? - Z-value of α/2 * (σ/√n) , x? + Z-value of α/2 * (σ/√n)]

We can find the z-values from the z-table. For confidence interval 90%, α = 0.1 and z-value for α/2 = 1.65. Similarly, for 95% and 99% corresponding z values are 1.96 and 2.58 respectively.

No alt text provided for this image — Z-table

Note - We can use z-values for finding confidence intervals when we know population variance and number of data points is equal to or above 30.

Confidence Intervals when population variance is unknown

Alright, Now the problem is how to find the confidence interval when we don't know population variance. The solution is student's T distribution (For refresher refer this blog ). We can use t-values to find confidence intervals for a smaller sample and unknown population variance by using the below formula

[ x? - t-value(n-1,α/2) * (s/√n) , x? + t-value(n-1,α/2) * (s/√n)]

Here, x? is sample mean, s = sample standard deviation, n = sample size, α = significance level and t-value(n-1,α/2) is t value for n-1 degree of freedom.

we can use t-table to find the t-value.

We can use t-value method when the population variance is unknown however t-values after 30 degree of freedom resembles z-values so it is advisable to use student's t statistics to find confidence interval when sample size is smaller than 30.

Refer the kaggle notebooks for CLT and z-statistics and for student's T distribution

Some advanced concepts for confidence intervals

So far we dealt with a single sample and calculated the confidence intervals or means but that is not the case always. In real life, we may have to analyse two samples and compare them. These can be classified as Dependent and Independent Samples. Dependent samples are basically same sample but considered twice e.g sample of patients who were tested for blood sugar levels before and after a medication. Independent samples are totally different samples into consideration e.g sample of test scores of engineering and management students.

Now, we will see how confidence intervals for these different types of multiple samples are calculated.

Confidence Intervals for dependent samples

If we have a sample of 10 patients who were measured for blood sugar levels before and after giving certain drug. we can use the below method to analyse the effect of drug on this sample.

Calculate the difference between the 2 readings of blood sugar levels for both the samples
find the mean and standard deviation of the differences.
Since the sample size is less than 30 and we don't know population variance of the differences, we can use student's t distribution concept to get the confidence interval with a certain confidence level.

[ x? - t-value(n-1,α/2) * (s/√n) , x? + t-value(n-1,α/2) * (s/√n)]

领英推荐

Four Flaws in Foundations of Statistics

Asad Zaman 2 年前

A Brilliant Example of Data Analysis

Mark Rapier CMAS, ALC 6 个月前

Guide to Churn Prediction: Part 3— Descriptive…

Mage 2 年前

Confidence Intervals for Independent samples with known population variance

Lets assume we have a sample of grades of 50 engineering students and 40 management students and we want to find the confidence interval for these 2 samples.

Calculate the difference between in the means grades for both of the streams separately.
Find the difference of the variances between the two using the formula:

Calculate confidence interval

[ mean difference - z-value(α/2) * sqrt(σ2) , mean difference + z-value(α/2) * sqrt(σ2)]

Confidence Intervals for Independent samples with unknown population variance - assumed to be equal

When samples are independent and population variance is unknown but assumed to be equal, we can use the concept of pooled sample variance for our calculations.

Since the variance in this case is unknown therefore we are going to use student's t statistics method for confidence intervals.

Confidence Intervals for Independent samples with unknown population variance - assumed to be unequal

This is similar to comparing apple and oranges which is rarely used in real life however for the sake of covering all the topics related to confidence intervals we are going to discuss this as well.

We will dive straight to the formulas which will come in handy for you to do the maths. Formula to find confidence interval is as follows:

Here, instead of pooled sample variance we are using sample variance for both the samples separately. Another thing to note is DF which is degree of freedom which can be calculated by another formula:

Conclusion

In this article, we covered a very important topic of inferential statistics that is confidence intervals. We also discussed what is central limit theorem and how it made our life easy while calculating confidence intervals. If you are an aspiring data scientist, these are the most fundamental concepts one should know.

In the coming blogs, I will cover the most demanding topic which is hypothesis testing. Next week, I will also be talking about Generative AI in a separate blog and video so follow this newsletter to stay updated about my new articles.

See you soon! Have a nice week ahead!

Author?: Abhi Sharma -?Linkedin

Machines who think - ML & AI

929 位关注者

要查看或添加评论，请登录

Abhi Sharma的更多文章

Idea Evaluation is the foundation - Mind your own business

2024年9月13日

Idea Evaluation is the foundation - Mind your own business

Welcome to the first blog of "Mind your own business" series. If you're a seasoned entrepreneur with a track record of…
Is your data normal? Check for normality

2023年11月17日

Is your data normal? Check for normality

Normal distribution is one of the extremely important concept in data science. It is a bread and butter for data…

1 条评论
Do data scientists earn more than data engineers? - Prove by Hypothesis Testing

2023年6月22日

Do data scientists earn more than data engineers? - Prove by Hypothesis Testing

Welcome back to the newsletter. This is another dose of inferential statistics where we are going to see how hypothesis…
Data Analytics is all about Statistics

2023年6月6日

Data Analytics is all about Statistics

Data Analytics is a term which is being used widely these days. Almost everyone is either doing data analytics or using…

7 条评论
Market Basket Analysis - Association Rule Mining, Apriori Algorithm

2023年5月24日

Market Basket Analysis - Association Rule Mining, Apriori Algorithm

Market Basket Analysis is one of the most common and basic problem in data science world. It is typically used for…

See all articles

Be confident with confidence intervals

Abhi Sharma

Cloud Engineer @ Google | Ex-Founder | Building on Entrepreneurial Foundations

Central Limit Theorem

Confidence Intervals

Confidence intervals when population variance is known

Confidence Intervals when population variance is unknown

Some advanced concepts for confidence intervals

领英推荐

Conclusion

Machines who think - ML & AI

929 位关注者

Abhi Sharma的更多文章

社区洞察

其他会员也浏览了

MAP, REAN and WHY: starting a journey with data analytics

Mastering Pivot Tables: The Key to Unlocking Insights from Big Data in Zimbabwe ??

Why is it important to contextualize data?

BASICS OF PROBABILITY AND STATISTICS :

The Data Analyst Dilemma: Are We Really Oversaturated?

Is Data Analytics Your True Calling?

Case Study

Understanding Summary Statistics

Important statistics for Data science

Data & Business Analytics Series: Basis Statistics (1/n)

Central Limit Theorem

Confidence Intervals

Confidence intervals when population variance is known

Confidence Intervals when population variance is unknown

Some advanced concepts for confidence intervals

领英推荐

Conclusion

Machines who think - ML & AI

929 位关注者

Abhi Sharma的更多文章

Idea Evaluation is the foundation - Mind your own business

Is your data normal? Check for normality

Do data scientists earn more than data engineers? - Prove by Hypothesis Testing

Data Analytics is all about Statistics

Market Basket Analysis - Association Rule Mining, Apriori Algorithm

社区洞察

其他会员也浏览了

MAP, REAN and WHY: starting a journey with data analytics

Mastering Pivot Tables: The Key to Unlocking Insights from Big Data in Zimbabwe ??

Why is it important to contextualize data?

BASICS OF PROBABILITY AND STATISTICS :

The Data Analyst Dilemma: Are We Really Oversaturated?

Is Data Analytics Your True Calling?

Case Study

Understanding Summary Statistics

Important statistics for Data science

Data & Business Analytics Series: Basis Statistics (1/n)