Significance Testing is Broken (and How to Fix it)

Significance Testing is Broken (and How to Fix it)

Making inferences from a sample of data is hard. We often use significance testing to see how well we can generalize from our sample to the population of interest.

However, when used inappropriately, p-values can be dangerous.

Consider the following:

  1. If you have a large sample and get a significant result, it might be an effect so small that it doesn’t matter.
  2. Regardless of sample size, one in twenty significance tests will find false positives due to chance. You’ll get significant results when absolutely nothing is happening. This is how probability works.
  3. If you are working with small samples, although there may be real insights in the data, you may lack the sample size to find them, but the false positives will be unaffected. You may end up finding all the results that aren’t there (false positives) but none of the results that are there (true positives).

Confused? Don’t worry, let’s break this down. I promise that you will feel significantly less confused by the end of this article.

This article will provide a conceptual understanding of significance testing. If you work with research but don’t have a background in this area, I hope this helps you get more value from the data. If you work in data analysis, this will likely be nothing new, but will hopefully help you consolidate your knowledge.

What p-values actually mean

I’ve worked with many smart people who’ve been taught what p-values actually refer to.

“The p-value is the probability of the sample data assuming that the null hypothesis is true.”

The problem with this definition is, although it’s entirely correct, I honestly have no idea what it means. I had more luck reading Kant during my undergraduate studies. Something this fundamental to research should be taught in plain English in a way that helps people actually get it.

Please allow me to have a go.

Imagine you wake up and your whole house is covered in smoke, the smell of burning wood is everywhere and the air is almost unbearably hot. The p-value is the likelihood of all of these things happening, assuming your house is not on fire.

Given that this is quite unlikely, we reject the null hypothesis of no fire and you hopefully survive the ordeal.

The p-value is the likelihood of your data assuming that there is no effect. Let’s not confuse that with the likelihood of finding this result due to chance (which people often say but I can categorically state that is wrong).

Let’s take a real-life scenario and make it a bit more practical.

Say, for instance, we have a customer base of 10 million people with a mean score of 7.5 and a standard deviation of 2.7. How do we know this? I don’t know. This is a hypothetical so I’m taking liberties with the laws of physics. Given that the mean is 7.5, if you draw a sample, the mean will probably be pretty close to 7.5. But you might get unlucky and get a sample composed of very satisfied or very unsatisfied customers, even if you have a perfectly representative sample.

What makes you more or less likely to get a score that is wildly different to 7.5? Sample size.

If you decided you had to cut your research budget and just asked ten people, you’d be quite likely to get an average score quite different from 7.5. But how different? Let’s play a quick thought experiment. Imagine you decide that you’ll take 10,000 samples of ten people. That’s lots of samples but each one has only ten people. If you look below, you can see a histogram with the average satisfaction scores from each of these 10,000 samples.

Most of the time you’d be somewhere in the ballpark of 6.5 to 8.5, but these samples seem to vary a lot. Keep in mind, the scores below are average scores for each sample, not individual customer scores. For example, look on the left and you’ll see that once in 10,000 samples you’d have a mean of 4.27. Imagine how the board would react to a customer satisfaction average of 4.27? And what about a score of 10?

Keep in mind that the average score for the 10 million customers is still 7.5. It’s just that sometimes we draw samples of more or less satisfied customers and get different estimates. This is all pure chance and the only way to improve this is to increase the sample size.

So how about we do just that? What if we did the same thing with samples of 100 people each time?

Once again, we see a very similar shape. But just look at the scale. The lowest average score is 6.4 and the highest is 8.6. Most of the samples have averages between 7 and 8 so we are generally landing in the ballpark.

But what if we supersized our sample and moved up to 1,000 people in each sample?

Now, we almost always land between 7.3 and 7.7. It’s like we are playing a game of darts but the first guy, our samples of 10, is blind drunk, the second had never played before and the last was a pro who would hit pretty close to the bullseye every time.

Wait, but what about p-values?

The answer is sitting right in front of you. The p-value is about probability. It’s the probability of getting a certain average score from a sample given a population average and the size of your sample. In this case, the population average is 7.5. If you found 1,000 people with an average score of 7.33 or lower or 7.67 or higher, we are talking about something that happens pretty rarely when the true mean score is actually 7.5. If you look below, you can see the two tails of the distribution, which show the fairly rare occurrences. Only one in twenty samples will have means outside the red lines, which is not very frequent.

The application of the p-value is simple. If we got a score of 7.5 last time but a score of 7.0 this time, it could be that this happened due to chance. But, if you look above, this is a very improbable outcome if the customer base has an average satisfaction score of 7.5 and both measurements had samples of 1,000. So, we then conclude that we have a bad assumption. Our customer base has an average satisfaction score of something different from 7.5. But the p-value doesn’t actually tell us anything about what the actual score is, we simply know which score it probably isn’t.

Confidence Intervals

The complement of the p-value is the confidence interval. Rather than comparing our data to the null hypothesis, we build a distribution around the mean of our sample. Although we have simulated data below, in practice we effectively build our distribution by resampling from our sample and seeing what plausible values we would get if we were to do the same experiment many times over. What we get is something very familiar. Keep in mind that the formulas for the confidence interval give you pretty much the same result, this is just a more useful way of thinking about what it means: what are a likely range of scores based on our sample data?

If you look above, you can see that drawing subsets from our actual sample will give us samples that, 95% of the time, have average scores between 6.86 and 7.14. This then gives us the concept of confidence. We can say we are 95% confident that the actual population mean is between these two values. Luckily, we have 1,000 people in our sample. If we only had 10, our confidence interval would be massive (between 5.84 and 9.15) and we’d almost be better off guessing.

I should add an important caveat here, one that took me a long time to understand. When we say we are 95% confident, that does not mean that there is a 95% chance that the actual mean is within our confidence interval. This is a common criticism of traditional (frequentist) statistics.

The idea of the confidence interval is that, if you were to draw random samples from the population, 95% of the time your confidence intervals would contain the true mean score.

But what does it mean for us regular folk? Much like p-values, confidence intervals can be off due to chance. If you measure 10,000 metrics, around 500 will not contain the population value due to random error.

Those caveats aside, this shows that we are quite confident that customer satisfaction has declined since last quarter as the confidence intervals don’t overlap by a fairly substantial margin. It looks like our customer base has become less satisfied, on average, than they were last quarter.

So, why is this so dangerous?

One academic, over ten years ago, argued that most research findings were likely false and demonstrated this mathematically. This is because many academic studies are underpowered and so we get very imprecise measures. Just like with our samples of 10, the mean for the population of interest has a wide range of plausible values. This means, if we look for differences between groups or try to get model coefficient estimates, we don’t have confidence that any differences exist.

But that doesn’t mean that the laws of probability don’t apply. One in twenty significance tests will still be significant as, 5% of the time, you find yourself in the tails of the distribution. You get a sample score that is unlikely, given the population mean, due to chance. Your confidence intervals will tell you much the same story.

You wake up in your house and it seems that there’s a fire but your housemates are just cooking a Bombe Alaska. It’s a false alarm. But you have a broken fire alarm, so whenever there is a fire, you don’t actually notice it.

Is this meaningful in practice?

Most researchers are under pressure to find insights in data. In market research, for example, people often test every hypothesis under the sun and one in twenty will be a false positive. If you don’t have any real findings (or don’t have enough sample to find them) all the findings will be false positives. Although p-hacking is often considered to be the deliberate misuse of significance testing, you can do it without realizing it. If you come up with fifty ideas that have no basis in truth, two or three will be supported by the data. Despite them being nonsense. I even tested this myself while writing this article by creating uncorrelated variables and checking for significant differences in crosstabs and looking for correlations.

It was not hard to find significant effects.

Furthermore, large sample sizes give you very narrow confidence intervals which means that tiny effects will become significant. If you have a sample of one million, the confidence interval for the customer base is between 7.4973 and 7.5027. If average customer satisfaction increases from 7.501 to 7.502, do we really care? And would it be misleading to tell business stakeholders that customer satisfaction has significantly increased? Probably. The bottom line? If you work in big data, significance testing may often be not just unnecessary, but potentially dishonest.

What should we do?

There are no hard and fast rules, statistical thinking requires nuance and I won't claim to be wise beyond my years. But here are some ideas that are generally advisable.

Confidence Intervals: Don’t just use p-values. Use confidence intervals so you can get estimates of plausible values, rather than a binary significant/non-significant category based on an arbitrary value.

Consider Probability: If you pull out a coin and flip five heads in a row, something rather improbable has happened. If you flip coins 1000 times a day for a year, you’d expect that to happen 22,812 times over the course of a year. If you are running lots of different tests, you will get more false positives. If you do a lot of tests, set a higher burden of proof (maybe p<.01 or lower). 

Design Good Research: A well-designed questionnaire or eDM campaign with a carefully considered sample size will make it easier for you to say, with some confidence, that an effect is present or absent. Trying to hack the data post-hoc will lead you down a dark and scary path. Trust me, every data analyst has been there and it is not pleasant. 

Consider Alternative Statistical Methodologies: The abuse of the t-test needs to stop. If you have a categorical variable and you are looking for differences, run an ANOVA and correct for multiple comparisons. Consider Bayesian statistics that incorporate prior beliefs and, despite being more mathematically difficult, provide more interpretable results.

Develop Statistical Thinking: Develop a deep knowledge of statistics and research design. The more these are embedded in your thinking, the better you will be able to use them to draw meaningful insights from data. I won’t lie, it’s a work in progress for me too.

So, next time you find some counter-intuitive insights in your data, resist the temptation to start telling a story with it and first consider how this result was found and how confident you are that this is not simply an artifact of probability working its deftly magic against you.

Carlos Karl Robles

Senior Data Scientist at Riot Games

7 年

There's debate currently going on in academia regarding changing the acceptable p-value threshold from .05 to .005, relegating findings at the .05 level to be considered as "suggestive" rather than significant. There was a great podcast on p-values done last month that I think you'll enjoy, check it: https://partiallyderivative.com/podcast/2017/08/08/p-values

Ryan Johnson, PhD

Executive in Technology | Empathetic Leader | Data Scientist | Data Engineering | Machine Learning | Analytics and Business Intelligence | AI

7 年

Very good. The "Design Good Research" solution is one that is sorely under-utilized. Improving overall statistical thinking is also an important step in truly solving this common issue. I think migrating towards language similar to "this value is the chance our results are a false positive" rather than just reporting a p-value is useful in guiding the sort of thinking that actually leads to reliable results. You don't go so far as to say it, but I'm starting to think not reporting p-values until other metrics have been shown and a preliminary decision has been made might be a good strategy.

Scott Edwards

Staff Data Engineer

7 年

Jehan Gonsal awesome article that summarizes in a very cogent and concise manner many of the misunderstood concepts in significance testing. However, I'm a bit confused on how you derived the distribution under the section "Confidence Intervals." That graph has the exact same title as the graph above ("Histogram of Sample Means with n=1000") but the distribution along the x-axis appears to be quite different (a center of 7.0 vs. 7.5 for the graph above). Would you mind going in a bit more detail on how you derived this distribution and how it is used to generate confidence intervals? You hinted at "resampling" so perhaps this uses a newer technique similar to the bootstrap that I'm less familiar with. Thanks! --Scott

回复

要查看或添加评论,请登录

Jehan Gonsal的更多文章

社区洞察

其他会员也浏览了