登录查看更多内容

Hypothesis test, significance test, and confidence intervals

Kaspar Rufibach

Passionate about supporting statisticians to continuously challenge the status quo. | "The most meaningful way to succeed is to help others succeed." (Adam Grant) | "A man who casts no shadow has no soul." (Iron Maiden)

发布日期: 2025年2月8日

While working on our textbook about 15 years ago I looked into the history of concepts like confidence interval, hypothesis and significance test, and had many discussions with Leo Held about the topic. My impression is that in the current debate folks are voicing their view without reference to the history and proper understanding where concepts actually come from. To quote Brad Efron:

Those Who Ignore Statistics Are Condemned to Reinvent it.

So, let’s get started:

Fisher "invented" p-values to quantify the evidence against a null hypothesis of no effect. For him, they were a tool independent of a given test and sample size which allowed combination (meta-analysis anyone?) of evidence from multiple different experiments to finally arrive at a final conclusion about a scientific question based on the totality of evidence. He called this concept SIGNIFICANCE test.

Neyman and Pearson proposed a framework for making a decision between a precisely specified null and alternative hypothesis. Whenever you make an explicit decision you run the risk of that decision being wrong, and N-P wanted to quantify that risk. This framework is referred to as HYPOTHESIS test. If you reject the null you call this "statistically significant". If you remember how you learned to make the decision in a hypothesis test – comparing a test statistic to the quantile of the distribution of it under the null – no p-value is needed at all.

Finally, with a confidence interval we assess what true populations values are compatible with the data that we have observed.

It should become clear that these three concepts were initially meant to address very distinct goals. So, where does the confusion come from?

You can make a decision in a hypothesis test using either a p-value (the infamous p <= 0.05) or using the confidence interval (is the null outside of the CI?). However, that was not at all their initial intention.
Of course, it does not help that we call the result of a HYPOTHESIS test “statistically significant” when Fisher called his concept of p-value SIGNIFICANCE test.
A lot of what we call "science" is incentivized to make claims, and it has become easy to separate "findings" from "non findings" by simply comparing some p-value to the arbitrary threshold of 0.05, although that mixes the concepts of hypothesis and significance test and completely ignores the issue of multiplicity induced through the former.
p-values are often used to coin "findings" as "statistically significant". This mixes the concepts of significance and hypothesis test and this has never been in the sense of each concept's "inventor". I therefore consider it odd, to say the least, to "ditch" p-values because they are not used properly. Rather, (1) statisticians need to use concepts properly and (2) educate science as a whole how these concepts should be used and what their respective properties are. I know, not easy, but still.

领英推荐

Why is transforming the response in regression…

Adrian Olszewski 2 年前

ARIMAX: Time Series Forecasting with External Variables

Marcin Majka 5 个月前

From Dilemmas to Decisions: The Art of Quantitative…

Sam Gideon G 1 年前

So, when to use what? In my opinion, a hypothesis test only makes sense (if at all) in the context of pharmaceutical drug development. There, you only start a Phase 3 trial based on sufficient prior synthesized evidence and you very carefully select the primary endpoint. Furthermore, relevant stakeholders, primarily regulators, have a genuine interest in controlling false-decision probabilities, notable the type I error. Hypothesis tests have proven useful in providing a pre-specifiable framework that leads to reasonably sized trials.

However, in science at large, this is not how we generate evidence. Rather, to quote Sterne and Smith (2001)

In many cases published medical literature requires no firm decision: it contributes incrementally to an existing body of knowledge.

Furthermore, Blume and Peipert (2003):

The reporting of scientific results is not about making decisions, but about collecting, summarizing, and reevaluating evidence.

So what do I recommend?

Unless we are really talking about a pre-specified hypothesis test with interest in one single pre-specified hypothesis we should not use the term "statistically significant". Specifically, when doing some exploratory analysis and looking at many p-values we should not assign this label to those below a certain threshold. Potential qualification of evidence based on p-values, see above picture, is given in the paper by Pocock et al (2015).
Use "statistically significant" for a clearly defined hypothesis test that actually entails a decision, like in pharmaceutical drug development (acknowledging that a significant hypothesis test for a primary endpoint does not mean "approval", but is at least an entry ticket for negotiation with regulators).
The decision about "significance" in a hypothesis test is a binary one. This means a hypothesis test is either significant or not. Labels like "highly significant" or even much worse something like "trend to significance" are bogus and should not be used. Or is a p-value of 0.04 a "trend towards non-significance"?

That’s a wrap! I leave the discussion of “relevance” vs. “significance” and how Bayes plays into all that for another day.

Jingtao Wu

Head of Biostatistics and Data Management, Carmot Therapeutics

1 天前

Go Bayesian!

Ramzi Mrad

Healthcare Entrepreneurship | ex-Roche & Novartis | INSEAD MBA

4 天前

Jettison the p-value and report the Confidence Interval instead.

Erik Bloomquist

Senior Principal Scientist; ASA BIOP Section Chair 2025

5 天前

I supposedly well shuffle my deck and hand out cards. ?You have to decide if I shuffle. ?3 of a kind of better = 2.87% (some evidence maybe you didn’t shuffle well). ?Straight or better = 0.76% (good evidence you didn’t shuffle well). Full house or better = 0.17% (clearly you didn’t shuffle). ?Not an exact 1-1 correspondence but perhaps a reasonable analogy. ?Notice that a pair or better is 50%.

Christos Nakas

PhD Professor, University of Thessaly ???? Inselspital/University of Bern ???? Biostatistics/Biometry/Data Science

6 天前

Thanks, mostly agree! Suggesting a good read here: https://brnw.ch/21wPUI0 P-values have played a central role in the advancement of research in virtually all scientific fields; however, there has been significant controversy over their use. “The ASA president’s task force statement on statistical significance and replicability” has provided a solid basis for resolving the quarrel, but although the significance part is clearly dealt with, the replicability part raises further discussions. Given the clear statement regarding significance, in this article, authors consider the validity of p-value use for statistical inference as de facto. They briefly review the bibliography regarding the relevant controversy in recent years and illustrate how already proposed approaches, or slight adaptations thereof, can be readily implemented to address both significance and reproducibility, adding credibility to empirical study findings. The definitions used for the notions of replicability and reproducibility are also clearly described. Authors argue that any p-value must be reported along with its corresponding s-value followed by (1???)% confidence intervals and the rejection replication index.

David Manteigas

Principal Biostatistician at ICON

1 周

I tend to agree with this opinion, although I don't see much value in reporting p-values outside of prespecified hypothesis and decision making. A confidence interval is always the best option for all other cases.

查看更多评论

要查看或添加评论，请登录

Kaspar Rufibach的更多文章

What is your intention behind using "intention-to-treat"?

2024年9月22日

What is your intention behind using "intention-to-treat"?

Drug developers: Who (believes) is familiar with the term intention-to-treat or ITT? Maybe even ITT population? Before…

16 条评论
Whether to randomize or not has nothing to do with the sample size of a clinical trial

2023年6月8日

Whether to randomize or not has nothing to do with the sample size of a clinical trial

Statisticians -- and drug developers in any role: In this post I will bust some myths around the size of RCTs. Have you…

20 条评论
How does the ICH (R1) estimands addendum fit into the picture?

2023年6月7日

How does the ICH (R1) estimands addendum fit into the picture?

Over the last two days I discussed the benefits of randomization and why moving forward we likely will still need RCTs,…

9 条评论
Why can't we simply get rid of RCTs?

2023年6月6日

Why can't we simply get rid of RCTs?

Yesterday I discussed why we randomize in clinical trials here. Thanks for all the comments and feedback! A question…

10 条评论
Why do we do interim analyses in clinical trials?

2023年4月2日

Why do we do interim analyses in clinical trials?

Trialists: Why do we perform interim analyses to potentially stop clinical trials for efficacy or futility? In this…

11 条评论
Are regulators "on board" with estimands?

2022年12月2日

Are regulators "on board" with estimands?

As estimands subject-matter expert at Roche I often get asked: Are regulators on board with estimands at all? I cannot…

3 条评论

See all articles

Hypothesis test, significance test, and confidence intervals

Kaspar Rufibach

Passionate about supporting statisticians to continuously challenge the status quo. | "The most meaningful way to succeed is to help others succeed." (Adam Grant) | "A man who casts no shadow has no soul." (Iron Maiden)

领英推荐

Kaspar Rufibach的更多文章

社区洞察

其他会员也浏览了

?? Exploring the Depths of Knowledge: Understanding Structural Equation Modeling (SEM)

Logistic regression can replicate multiple parametric and non-parametric tests of proportions

Risk Perception & Varying Contexts or Representation that Ruins Pure Mathematical & Quantitative Risk Values or Measurements in the Real World

The [3N] Method

Sampling size evaluation

Evaluation of logistic regression model ( Must read for all )

Concise Basic Stats - Part VI: Hypothesis Testing

How to build a Hypothesis Test?

The Day, Linear Regression fails - Example 1

Article 54: A Deep Dive into Confirmatory and Exploratory Factor Analysis

领英推荐

Kaspar Rufibach的更多文章

What is your intention behind using "intention-to-treat"?

Whether to randomize or not has nothing to do with the sample size of a clinical trial

How does the ICH (R1) estimands addendum fit into the picture?

Why can't we simply get rid of RCTs?

Why do we do interim analyses in clinical trials?

Are regulators "on board" with estimands?

社区洞察

其他会员也浏览了

?? Exploring the Depths of Knowledge: Understanding Structural Equation Modeling (SEM)

Logistic regression can replicate multiple parametric and non-parametric tests of proportions

Risk Perception & Varying Contexts or Representation that Ruins Pure Mathematical & Quantitative Risk Values or Measurements in the Real World

The [3N] Method

Sampling size evaluation

Evaluation of logistic regression model ( Must read for all )

Concise Basic Stats - Part VI: Hypothesis Testing

How to build a Hypothesis Test?

The Day, Linear Regression fails - Example 1

Article 54: A Deep Dive into Confirmatory and Exploratory Factor Analysis