Hypothesis test, significance test, and confidence intervals
Extracted from Pocock et al (2015). https://doi.org/10.1016/j.jacc.2015.10.014

Hypothesis test, significance test, and confidence intervals

While working on our textbook about 15 years ago I looked into the history of concepts like confidence interval, hypothesis and significance test, and had many discussions with Leo Held about the topic. My impression is that in the current debate folks are voicing their view without reference to the history and proper understanding where concepts actually come from. To quote Brad Efron:

Those Who Ignore Statistics Are Condemned to Reinvent it.

So, let’s get started:

Fisher "invented" p-values to quantify the evidence against a null hypothesis of no effect. For him, they were a tool independent of a given test and sample size which allowed combination (meta-analysis anyone?) of evidence from multiple different experiments to finally arrive at a final conclusion about a scientific question based on the totality of evidence. He called this concept SIGNIFICANCE test.

Neyman and Pearson proposed a framework for making a decision between a precisely specified null and alternative hypothesis. Whenever you make an explicit decision you run the risk of that decision being wrong, and N-P wanted to quantify that risk. This framework is referred to as HYPOTHESIS test. If you reject the null you call this "statistically significant". If you remember how you learned to make the decision in a hypothesis test – comparing a test statistic to the quantile of the distribution of it under the null – no p-value is needed at all.

Finally, with a confidence interval we assess what true populations values are compatible with the data that we have observed.

It should become clear that these three concepts were initially meant to address very distinct goals. So, where does the confusion come from?

  • You can make a decision in a hypothesis test using either a p-value (the infamous p <= 0.05) or using the confidence interval (is the null outside of the CI?). However, that was not at all their initial intention.
  • Of course, it does not help that we call the result of a HYPOTHESIS test “statistically significant” when Fisher called his concept of p-value SIGNIFICANCE test.
  • A lot of what we call "science" is incentivized to make claims, and it has become easy to separate "findings" from "non findings" by simply comparing some p-value to the arbitrary threshold of 0.05, although that mixes the concepts of hypothesis and significance test and completely ignores the issue of multiplicity induced through the former.
  • p-values are often used to coin "findings" as "statistically significant". This mixes the concepts of significance and hypothesis test and this has never been in the sense of each concept's "inventor". I therefore consider it odd, to say the least, to "ditch" p-values because they are not used properly. Rather, (1) statisticians need to use concepts properly and (2) educate science as a whole how these concepts should be used and what their respective properties are. I know, not easy, but still.

So, when to use what? In my opinion, a hypothesis test only makes sense (if at all) in the context of pharmaceutical drug development. There, you only start a Phase 3 trial based on sufficient prior synthesized evidence and you very carefully select the primary endpoint. Furthermore, relevant stakeholders, primarily regulators, have a genuine interest in controlling false-decision probabilities, notable the type I error. Hypothesis tests have proven useful in providing a pre-specifiable framework that leads to reasonably sized trials.

However, in science at large, this is not how we generate evidence. Rather, to quote Sterne and Smith (2001)

In many cases published medical literature requires no firm decision: it contributes incrementally to an existing body of knowledge.

Furthermore, Blume and Peipert (2003):

The reporting of scientific results is not about making decisions, but about collecting, summarizing, and reevaluating evidence.

So what do I recommend?

  • Unless we are really talking about a pre-specified hypothesis test with interest in one single pre-specified hypothesis we should not use the term "statistically significant". Specifically, when doing some exploratory analysis and looking at many p-values we should not assign this label to those below a certain threshold. Potential qualification of evidence based on p-values, see above picture, is given in the paper by Pocock et al (2015).
  • Use "statistically significant" for a clearly defined hypothesis test that actually entails a decision, like in pharmaceutical drug development (acknowledging that a significant hypothesis test for a primary endpoint does not mean "approval", but is at least an entry ticket for negotiation with regulators).
  • The decision about "significance" in a hypothesis test is a binary one. This means a hypothesis test is either significant or not. Labels like "highly significant" or even much worse something like "trend to significance" are bogus and should not be used. Or is a p-value of 0.04 a "trend towards non-significance"?

That’s a wrap! I leave the discussion of “relevance” vs. “significance” and how Bayes plays into all that for another day.

Jingtao Wu

Head of Biostatistics and Data Management, Carmot Therapeutics

1 天前

Go Bayesian!

回复
Ramzi Mrad

Healthcare Entrepreneurship | ex-Roche & Novartis | INSEAD MBA

4 天前

Jettison the p-value and report the Confidence Interval instead.

回复
Erik Bloomquist

Senior Principal Scientist; ASA BIOP Section Chair 2025

5 天前

I supposedly well shuffle my deck and hand out cards. ?You have to decide if I shuffle. ?3 of a kind of better = 2.87% (some evidence maybe you didn’t shuffle well). ?Straight or better = 0.76% (good evidence you didn’t shuffle well). Full house or better = 0.17% (clearly you didn’t shuffle). ?Not an exact 1-1 correspondence but perhaps a reasonable analogy. ?Notice that a pair or better is 50%.

回复
Christos Nakas

PhD Professor, University of Thessaly ???? Inselspital/University of Bern ???? Biostatistics/Biometry/Data Science

6 天前

Thanks, mostly agree! Suggesting a good read here: https://brnw.ch/21wPUI0 P-values have played a central role in the advancement of research in virtually all scientific fields; however, there has been significant controversy over their use. “The ASA president’s task force statement on statistical significance and replicability” has provided a solid basis for resolving the quarrel, but although the significance part is clearly dealt with, the replicability part raises further discussions. Given the clear statement regarding significance, in this article, authors consider the validity of p-value use for statistical inference as de facto. They briefly review the bibliography regarding the relevant controversy in recent years and illustrate how already proposed approaches, or slight adaptations thereof, can be readily implemented to address both significance and reproducibility, adding credibility to empirical study findings. The definitions used for the notions of replicability and reproducibility are also clearly described. Authors argue that any p-value must be reported along with its corresponding s-value followed by (1???)% confidence intervals and the rejection replication index.

回复
David Manteigas

Principal Biostatistician at ICON

1 周

I tend to agree with this opinion, although I don't see much value in reporting p-values outside of prespecified hypothesis and decision making. A confidence interval is always the best option for all other cases.

回复

要查看或添加评论,请登录

Kaspar Rufibach的更多文章

社区洞察

其他会员也浏览了