Hypothesis test, significance test, and confidence intervals
Kaspar Rufibach
Passionate about supporting statisticians to continuously challenge the status quo. | "The most meaningful way to succeed is to help others succeed." (Adam Grant) | "A man who casts no shadow has no soul." (Iron Maiden)
While working on our textbook about 15 years ago I looked into the history of concepts like confidence interval, hypothesis and significance test, and had many discussions with Leo Held about the topic. My impression is that in the current debate folks are voicing their view without reference to the history and proper understanding where concepts actually come from. To quote Brad Efron:
Those Who Ignore Statistics Are Condemned to Reinvent it.
So, let’s get started:
Fisher "invented" p-values to quantify the evidence against a null hypothesis of no effect. For him, they were a tool independent of a given test and sample size which allowed combination (meta-analysis anyone?) of evidence from multiple different experiments to finally arrive at a final conclusion about a scientific question based on the totality of evidence. He called this concept SIGNIFICANCE test.
Neyman and Pearson proposed a framework for making a decision between a precisely specified null and alternative hypothesis. Whenever you make an explicit decision you run the risk of that decision being wrong, and N-P wanted to quantify that risk. This framework is referred to as HYPOTHESIS test. If you reject the null you call this "statistically significant". If you remember how you learned to make the decision in a hypothesis test – comparing a test statistic to the quantile of the distribution of it under the null – no p-value is needed at all.
Finally, with a confidence interval we assess what true populations values are compatible with the data that we have observed.
It should become clear that these three concepts were initially meant to address very distinct goals. So, where does the confusion come from?
领英推荐
So, when to use what? In my opinion, a hypothesis test only makes sense (if at all) in the context of pharmaceutical drug development. There, you only start a Phase 3 trial based on sufficient prior synthesized evidence and you very carefully select the primary endpoint. Furthermore, relevant stakeholders, primarily regulators, have a genuine interest in controlling false-decision probabilities, notable the type I error. Hypothesis tests have proven useful in providing a pre-specifiable framework that leads to reasonably sized trials.
However, in science at large, this is not how we generate evidence. Rather, to quote Sterne and Smith (2001)
In many cases published medical literature requires no firm decision: it contributes incrementally to an existing body of knowledge.
Furthermore, Blume and Peipert (2003):
The reporting of scientific results is not about making decisions, but about collecting, summarizing, and reevaluating evidence.
So what do I recommend?
That’s a wrap! I leave the discussion of “relevance” vs. “significance” and how Bayes plays into all that for another day.
Head of Biostatistics and Data Management, Carmot Therapeutics
1 天前Go Bayesian!
Healthcare Entrepreneurship | ex-Roche & Novartis | INSEAD MBA
4 天前Jettison the p-value and report the Confidence Interval instead.
Senior Principal Scientist; ASA BIOP Section Chair 2025
5 天前I supposedly well shuffle my deck and hand out cards. ?You have to decide if I shuffle. ?3 of a kind of better = 2.87% (some evidence maybe you didn’t shuffle well). ?Straight or better = 0.76% (good evidence you didn’t shuffle well). Full house or better = 0.17% (clearly you didn’t shuffle). ?Not an exact 1-1 correspondence but perhaps a reasonable analogy. ?Notice that a pair or better is 50%.
PhD Professor, University of Thessaly ???? Inselspital/University of Bern ???? Biostatistics/Biometry/Data Science
6 天前Thanks, mostly agree! Suggesting a good read here: https://brnw.ch/21wPUI0 P-values have played a central role in the advancement of research in virtually all scientific fields; however, there has been significant controversy over their use. “The ASA president’s task force statement on statistical significance and replicability” has provided a solid basis for resolving the quarrel, but although the significance part is clearly dealt with, the replicability part raises further discussions. Given the clear statement regarding significance, in this article, authors consider the validity of p-value use for statistical inference as de facto. They briefly review the bibliography regarding the relevant controversy in recent years and illustrate how already proposed approaches, or slight adaptations thereof, can be readily implemented to address both significance and reproducibility, adding credibility to empirical study findings. The definitions used for the notions of replicability and reproducibility are also clearly described. Authors argue that any p-value must be reported along with its corresponding s-value followed by (1???)% confidence intervals and the rejection replication index.
Principal Biostatistician at ICON
1 周I tend to agree with this opinion, although I don't see much value in reporting p-values outside of prespecified hypothesis and decision making. A confidence interval is always the best option for all other cases.