Demystifying p<.05: A Balanced Approach to Significance Testing (or Avoiding it Altogether) ????
John Neuhoff
Associate Director of UX Research | AI Strategist | PhD Psychology | Retired
Feeling shamed for not adhering to a p<.05 statistical significance rule in your UX research? Don’t.
The p<.05 standard is a benchmark set by Ronald Fisher in the 1920s before modern computers were available. It’s the probability that your results occurred simply by random chance. For example, if five users prefer Design A and four prefer Design B, would you be confident that the larger population prefers Design A? Of course not. Why? Because if you reran the test with new users you could get four who prefer A and five who prefer B. There’s no evidence that your designs differ in preference because the likelihood of getting these results by chance is high. In Fisher’s world, the odds that the results occurred by random chance are “greater than 5%." ?
So, how did we end up with the .05 standard? Early statisticians thought it was reasonable, and scientific journals picked it up and made it gospel. Besides, calculating by hand the exact probability of your results occurring by chance could have taken months! So they made a table of “critical values” that you could compare your statistical result against to see if it was “over or under” the critical value that indicates 5%. For its time, it was a useful concept.
Old traditions die hard. Even though we can calculate the exact alpha levels of experiments now in a flash, many people (and journals) cling to the old notion of p <.05 religiously. ?
But think about it. What if there’s a 6% chance that your results occurred by chance? Fisher would say your results are not statistically significant. But, if you’re in business, is there an appreciable difference in your decision-making when the chance of a false positive is 6% versus 5%? What about 9%?
The answer, of course, is “It depends.” What are the costs of a false positive? The p<.20 might be a reasonable standard if the costs are relatively small. If they are life and death, p<.05 seems woefully inadequate. Would you take an experimental treatment if there were “only” a 5% chance it would kill you?
领英推荐
Statistical significance testing also often ignores the importance of “effect size.” Let’s say you have a very large sample, and your new design is preferred more than the old one with a statistical significance level of p<.01. Great, right? Fisher would be proud. Now let’s say the mean preference on a 1-10 scale for the new design is 7.6, and the mean preference for the old design is 7.5. It’s a reliable statistically significant difference that would almost certainly replicate time and time again. But is it worth it to implement given the associated costs? No, because the effect size (although significant) is too small. ?
Is there a better way? Enter Bayesian analysis. Bayesian methods shift the focus from rigid, binary "significant or not" decisions to probabilistic reasoning. Think of it as a nuanced conversation with your data. Instead of asking, "Is this result statistically significant at the p<.05 level?" Bayesian analysis prompts a more relevant question: "Given the data and our prior knowledge, what is the probability that one design is genuinely better than the other?" This approach is particularly advantageous when dealing with complex or uncertain scenarios common in UX research. It allows for incorporating prior knowledge and expertise into the analysis, yielding contextually richer insights and often more directly applicable to business decisions.
Let's be clear: advocating for a more nuanced approach than the p<.05 standard is not a call to abandon hypothesis evaluation- far from it. Statistical analysis remains a cornerstone of robust UX research. But, it's time to rethink our adherence to the p<.05 dogma in UX research and embrace a more flexible, nuanced approach.
It's crucial to consider the real-world implications of our findings, the magnitude of effect sizes, and the consequences and practicality of decision-making thresholds. With their probabilistic and contextual richness, Bayesian methods offer a compelling alternative. So, let's break free from the shackles of p<.05 and step into a more informed and adaptable era of data analysis, where the true goal is insightful, actionable conclusions, not just statistical victories.
#UXResearchInsights #BeyondP05 #StatisticalSignificance #BayesianAnalysisUX #DataDrivenDesign
#RethinkStatistics
Principal Director UX Research Microsoft Azure Data and Fabric
1 年Joshua Noble - Bayesian!
Quantitative UX Researcher and Product Analytics Specialist
1 年"The p<.05 standard is a benchmark set by Ronald Fisher in the 1920s before modern computers were available. It’s the probability that your results occurred simply by random chance." This is not what a p-value is, even if it's the most common misinterpretation. Not up for debate, by the way. The p-value, instead, is the likelihood that you would have observed the results you got (or more extreme) under the null hypothesis (usually defined as chance, or no effect). In other words, its the extent to which your data are consistent (or inconsistent) with chance. p(D | H) is not the same thing as p(H | D). Case in point, let's say you flipped a coin twice, got one heads and one tails, and therefore got a p-value of 1.00 in a binomial test. It wouldn't even be coherent, let alone correct, to say that means "There is a 100% chance these results occurred by random chance." You might want to correct this in your article.
UX Researcher / Human Factors Engineer
1 年https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444174/
@GDC2025- Catch our IGDA session Game Jams for Kids Friday 10am 3/21
1 年What will it take to change the tide? p<.05 feels like the UX version of the developer’s always ending up debating if something is deterministic or probabilistic for every complex code challenge. So many other things can impact variance and variables and are worth of discussion.
Chevening scholar at LSE's MSc Management of Information Systems and Digital Innovation
1 年Alejandro Kantor