Homoscedasticity — From a line in a checklist to a key element in data analysis

Homoscedasticity — From a line in a checklist to a key element in data analysis

During my graduate program, on some training courses and even on some books and papers on statistics, I have heard/ read a very clear message:

“Before following a parametric approach to analyze your data and check what hypothesis test you should perform, you should check for the required assumptions — normality and homoscedasticity.”

Normality has always been a clear-cut issue for me — if the distribution is assimetric, using statistics such as mean and standard deviation, which depend a lot on data symmetry, is a clear mistake. To assess this, I normally plot the data, calculate skewness and kurtosis and, just to be on the safe side, normally do a Shapiro-Wilk test.

Homoscedasticity assessment is also an important step for determining whichstatistical approach to follow. However, why look at this assessment “only” as a step, rather than an important feature of data analysis?

What is homoscedasticity and why is it important?

A simple definition of homoscedasticity is homogeneity of variances. In practice, it means that, for example, when comparing two samples, these two samples are homoscedastic if their variance is not statistically different. But… Why is this important?

Looking just into the t-test, we see that there are implications of knowing if the variances are similar between two samples or if the variances are unknown/different. So assessing homoscedasticity is important to know (decide?) if you should apply Welch’s test, or unequal variance t-test, as it is known.

Nevertheless, the importance of variance homogeneity doesn’t stop here: If you go back to your Statisticsmanuals (or do a quick search on the Internet), you’ll find that a population with a normal distribution can be characterized by two parameters: mean and standard deviation. Therefore, two populations with normal distributions are different if their means and/or standard deviations are different.

For some scientific fields, it is impossible to work on full populations; therefore, they need to deal with samples from those populations. A sample can be defined as a set of individuals that are taken from a population using a specific mechanism (e.g. randomization). So, if the original population follows a normal distribution, the corresponding samples can also be characterized by a mean and a standard deviation.

Let’s work with one simple example: we have two different samples of 100 entries and we want to answer if these samples come from the same population.

By just looking at the plot, we already have some suspicions that they may be different, but let’s proceed with the analysis. If they come from a population with a normal distribution — which we are assuming — the temptation would be to quickly check if the variances are equal and then proceed to the right t-test. The result of this test is 0.3336. Even performing a Welch t-test, this tells us the two populations are not different. But, how can that be? Well, as you see in the output, the null hypothesis of the t-test is focused on the means; therefore, this addresses only one of the two parameters that can be different between the samples. Let’s now look into the variance, using a Levene’s Test:


As you can see, we get a very small p-value, meaning that the variances of these two samples are not equal. So, we now know that they have similar means but different variances, which should be an indication that the samples were taken from two different populations.

Even if we follow a non-parametric approach (more “relaxed” concerning the distribution assumption), we get similar results:

  • Wilcoxon Test

  • Fligner-Killeen Test (Non-parametric option for homogeneity of variances)

Taking all of this into consideration, should we say our samples don’t come from the same population? The evidence points to yes — and they should, since they were produced using random numbers from two different normal distributions, which varied in the standard deviation parameter.

So, the key message of this post is — we should give variance analysis higher importance, since it can tell us as much (or even more) than means and medians. In practice, we all should show the results of both traditional hypothesis testing and the tests for homogeneity of variances, since they can be very helpful to interpret the data and getting meaningful information about the phenomena we want to study.

要查看或添加评论,请登录

André Barros的更多文章

社区洞察

其他会员也浏览了