C for confidence interval and C for confusion
Sakshi Jain
Data Analyst | Transitioning to Data Science | A/B Testing & Statistical Analysis Expert | Passionate About Data-Driven Insights
“We are 95% confident that the population mean falls within the confidence interval.”
I am very sure you have seen the above statement in many research papers, newspapers and data analysis articles. But have you ever thought what confidence interval really mean and how these researchers become sure about their results? Let’s explore it.
A researcher collects data related to his/her research for a short time and then s/he works on this small dataset. The results of this dataset are used to infer what the whole (population) is like. We call this small portion of data as a sample and the entire dataset as a population. The challenge is how can we apply the results obtained from this small dataset to the population. Here our friend “Confidence interval” helps us.
A confidence interval measures the degree of uncertainty or certainty in a sampling method. In other words, it tells us how confident we can be that the results from a sample reflect what we would expect to find if it were possible to work on the entire population. Confidence intervals are intrinsically connected to confidence levels. Now what is confidence levels? Is there any difference between confidence interval and confidence levels? Are you confused now? Don’t be. I am here to help you.
Confidence levels are expressed as a percentage (for example, a 95% confidence level). It means that should you repeat an experiment or survey over and over again, 95 percent of the time your results will match the results you get from a population (in other words, your statistics would be sound!). Confidence intervals are your results and they are usually numbers. For example, you survey a group of pet owners to see how many cans of dog food they purchase a year. You test your statistic at the 99 percent confidence level and get a confidence interval of (200,300). That means you think they buy between 200 and 300 cans a year. You’re super confident (99% is a very high level!) that your results are sound, statistically.
We are not covering how to calculate confidence interval. There are a lot of web sites and stats books to refer to. The aim of this article is to discover the term in bird’s eye view and hope I am clear to explain the term confidence interval.
Now I am sure when you saw the term confidence interval, you also saw the term significance level. Although they sound very similar, significance level and confidence level are in fact two completely different concepts. In a hypothesis test, the significance level, alpha, is the probability of making the wrong decision when the null hypothesis (Null hypothesis states the exact opposite of what an investigator or an experimenter predicts or expects) is true.
Above, I defined a confidence level as answering the question: "...if the poll/test/experiment was repeated (over and over), would the results be the same?" In essence, confidence levels deal with repeatability. Significance levels on the other hand, have nothing at all to do with repeatability. They are set in the beginning of a specific type of experiment (a "hypothesis test") and controlled by the researcher/ experimenter.
I hope you are not confused with so many terms. The last term I want to cover is Margin of error. This also comes frequently with confidence interval. A margin of error tells you how many percentage points your results will differ from the real population value. For example, a 95% confidence interval with a 4 percent margin of error means that your statistic will be within 4 percentage points of the real population value 95% of the time.
The idea behind confidence levels and margins of error is that any survey or poll will differ from the true population by a certain amount. However, confidence intervals and margins of error reflect the fact that there is room for error, so although 95% or 98% confidence with a 2 percent Margin of Error might sound like a very good statistic, room for error is built in, which means sometimes statistics are wrong. For example, a Gallup poll in 2012 (incorrectly) stated that Romney would win the 2012 election with Romney at 49% and Obama at 48%. The stated confidence level was 95% with a margin of error of +/- 2, which means that the results were calculated to be accurate to within 2 percentages points 95% of the time.
The real results from the election were: Obama 51%, Romney 47%, which was actually even outside the range of the Gallup poll’s margin of error (2 percent), showing that not only can statistics be wrong, but polls can be too.
And that’s why I say welcome to the statistical world, my friends!
References:
https://towardsdatascience.com/confidence-intervals-explained-simply-for-data-scientists-8354a6e2266b
https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/margin-of-error/
https://www.dummies.com/education/science/biology/confidence-interval-basics/#:~:text=Informally%2C%20a%20confidence%20interval%20indicates,to%20encompass%20the%20true%20value.&text=A%20confidence%20interval%20indicates%20the,CI%20focuses%20on%20the%20population.
MBA | Project Management | Consulting| Clean Energy | Decarbonization |IIT Alumnus |MEL from UBC
4 年Great article indeed!!. Given clarity on many concepts. Keep doing the good work. All the best.
Clean Energy Engineering II Energy Efficiency II Project Management II
4 年Thanks for the article. I must say, it is eloquently written and much helpful for the people like me who are not that much acquainted with Statistical data analysis. Keep it on. Good luck!
Risk & Resilience Rockstar
4 年Excellent information Sakshi