Why use t-stat?
Hope you’ve read my article on Confidence Intervals. If not, better to read that before proceeding.
When you are calculating confidence intervals, you have two choices — either to use z-stat or a t-stat. Most often, because you don’t know the standard deviation of the population, you’ll approximate it with the standard deviation of the sample (S). Then you’ll be asked to use t-stat. But why?
The calculation of confidence interval lies on the assumption that the sampling distribution is in normal distribution.
But what if it is not?
In normal distribution, 95% of the values are within 2 standard deviation. Extreme values have very less chance of occurring. Can we say the same thing about other distributions? No, there is uncertainty. This uncertainty is even more if the sample size is small (< 30)
To account for this uncertainty, we assume a t-distribution instead. We are now telling that the extreme values have more chance of occurring when compared to the normal distribution.
As a result when you calculate Confidence intervals using t-stat, the band will be wider. Because at same confidence level, t-distribution covers more on x-axis.
But this divergence between t-distribution and normal distribution reduced with increasing sample size
So when n>30, you can calculate confidence intervals using z-stat only.
You might ask, in the world of big data, when do we even have a sample size of < 30? Why even learn about t-distribution?
A lot of ML applications assume t-distributions in their algorithms. Example, t-SNE Dimensionality Reduction. So, there are applications beyond simple inferences.