A Lay Interpretation of Statistical Significance and p-values: Slow Burn, Love at First Sight or Enduring Love?
It's rare that a lay interpretation of a technical matter will be more accurate or less misleading than its technical one, but in the case of statistical significance and the metric used to quantify it, the p-value, this is probably true. This post goes out to all non-technical folks and Statisticians/Data Scientists alike, as what people have studied in rigorous Statistics courses has, surprisingly to many Statisticians/Data Scientists even, been flawed all along.
First, what is the usual technical (but also couched as 'lay,' at least in technical circles) interpretation of a p-value? Say you're trying to see if there's a statistically significant difference between the means of an outcome like height between two groups. Then, the p-value is the probability that you will see a result (a difference in mean heights between the two groups) at least as large as what you're seeing from sampling if there were truly no difference in means between the two groups in the entire population. The importance of this lies in the intuition that it is very unlikely (p-value is very small) that the mean heights of the two samples from each of the two groups would be so vastly different if they were truly the same. Therefore, it's likely to be the case that the mean heights in these two different populations truly differ from each other. In Statistics, the phrase 'truly differ' is not used, but instead, 'are statistically significantly different' is. Does this necessarily mean that there is a large, meaningful difference between the two groups? No, and that's probably why terms like 'significantly different' (without 'statistically' preceding it), 'truly different' and all manner of more concise and 'human' terms are not to be used in the case of hypothesis testing (or what the non-Statistical community calls 'A/B testing') or statistical regression modeling that outputs p-values. The term 'statistically significant' has a very specific interpretation. Unfortunately, the definition in terms of probability that I gave above has long been debunked as quantitatively inaccurate, though it is still the standard definition taught in school and used at work. For those who're interested in its very complex and esoteric quantitative definition, you know how to find it. Nevertheless, understanding the definition above will partly lead us to a very good understanding of the only 2 things you really need to know about statistical significance:
- It is affected by sample size. The larger the sample, the greater the likelihood that your p-value will be below 0.05, which is the standard criterion for declaring statistical significance.
- It is affected by magnitude of 'effect size.' The larger the difference in means or proportions between your two groups, the greater the likelihood of attaining statistical significance.
Often, people without a background in Statistics (I don't necessarily mean academically) are unaware of or forget about the two factors above that determine statistical significance, and thus jump to wrong conclusions when they view p-values. You can't really blame them, given that the media often use terms like 'significant factors' (not 'statistically significant factors') with meanings that lay people conflate with the meaning of 'statistically significant.' It's not exactly the fault of these media in cases where they never meant to convey any statistical perspective, and besides, the use of the word 'significant' to mean 'substantial' is so prevalent in everyday language that it's easy for most people to assume 'statistically significant' means 'substantial' in whatever way they assume 'substantial' means. If someone who's not statistically trained sees that a p-value is >0.05, they may just assume that an input variable in a regression model has no intrinsic association with the outcome variable. It is right, however, to think that there is no statistically significant association - yet. The sample size may simply not have been huge enough and perhaps statistical significance will emerge in the future when a greater sample has been gathered. On the other hand, just because an input variable is statistically significant doesn't necessarily mean that it has much business value; it could be statistically significant just because the sample size used is huge, such that even a small 'effect size' on the outcome variable allows the variable to be statistically significance. I put the term 'effect size' in quotes simply because I wanted to avoid any confusion that the variable has a causal effect on the outcome variable. This 'effect size' (called a coefficient in regression modeling) is the magnitude of the association with the outcome variable, quantifying how much the outcome variable changes when one increases one unit of the input variable, holding all other variables at a constant level.
From 1) and 2), here is a simple lay aphorism that can help you remember what affects statistical significance: 'slow burn, love at first sight or enduring love?' (Frankly, the term 'enduring passion' fits better.) Say you've been visiting a city and the views through repeated visits during all seasons have been consistently pleasant. You'd move there after all these visits and a declaration to yourself that it's statistically significantly more beautiful than the average city or the one you're moving from, but you probably wouldn't move there after just one or two visits. That is statistical significance derived from a slow burn. Now, suppose you visit a city so beautiful to you that you wish heaven were modeled after it, you'd move there just after one visit. That, of course, is statistical significance from love at first sight. (I would not recommend this. Seasons change, municipal budgets to clean up the streets change, etc.) Of course, statistical significance can be derived from 'enduring passion' too and I think that is the statistical significance that most people solving business problems would like to see - a huge sample and a meaningfully large 'effect size' driving the statistical significance.
I like your analogy of "slow burn" and "love at first sight" for small and large effect size! To some extend when some effect is so significant you don't even need math or stats to know it is (you could use math and stats to prove it is), because you have learned that experience through every day life.
Effect size is so important! because statistical significance can be completely useless without a sufficient effect size!