登录查看更多内容

Replication crisis

Stephen Senn

Statistical Consultant

发布日期: 2022年6月12日

The story so far

In a previous blog, Halls of Fame, I explained how the original data for a dietary experiment involving 20 halls with 100 students per hall were lost. Summary figures at the start of the academic year (baseline) and at the end (outcome) were available. Using these figures of means only, veteran curmudgeon Guernsey McPearson was able to produce an analysis as follows:

Then, miraculously, the original individual data per student were discovered. What happened next? I shall now reveal all.

Drowning in data

The Newtrition research team wasted no time in getting to work on the newly discoverd treasure-trove of original data. A meeting with Guernsey McPearson (GMcP) was requested by a representative (Rep) of the team and went something like this.

Rep Great news, Guernsey. You know that "trend towards non-significance" you were talking about. That is now a whoppingly significant result and not the puny P=0.023 you found. In fact, when we used the original data, not only was P<0.001 it was completely off the scale.

GMcP I can imagine. How did you handle the "hall" effect?

Rep Well we did puzzle about that. We put "hall" in as a factor but it failed to compute and gave us some strange messages, so we took it out again. Et voila.

GMcP Ah yes. You had hall as a fixed effect. You should, of course, have had it as random effect. Handling the initial weight as a covariate would, however, have been tricky. Basically you should just have stuck with the summary measures approach. Handled correctly, the individual student data would have given you the same answer.

Rep What? How can that be.

GMcP Give me the original data and I shall show you.

When more is just the same

GMcP was as good as his word. He returned with the following analysis* of the original 2000 values and not just the 20 means per hall.

GMcP You see, as I predicted, the result is just the same.

Rep But how can this be? You only analysed the 20 summary statistics before. Now you have analysed the 2000 values and you get the same answer? Looking at the table the students don't seem to be doing anything useful when it comes to judging the diet effect.

GMcP The data are hierarchical. There are two variances, between-hall and within-hall and two covariances between and within. However, you varied the diet between halls...

Rep That was a practical necessity!

GMcP I don't dispute it. I think that you were right to do so. But Nature is unsympathetic and neither a respector of motives nor of practical difficulties. Since diet varies between halls, the between-hall variance was relevant to judging the effect of diet and the between-hall covariance also. You have to watch out for the dangers of pseudoreplication.

Rep But does that mean it was pointless measuring 100 students per hall? Wouldn't 50 or 10 or even one have done just as well?

GMcP Oh no, not at all. The more students you measured, the greater the precision with which you measured things within each hall but this increase in precision was already reflected empirically in the means you calculated. I used these means in my previous analysis and in using them to estimate the variance I automatically, and without having to model it explicitly, took account not only of the between-hall variation but of the contribution that within-hall variation made to overall uncertainty. Going to the original data, however, I had to take care to model the various effects appropriately.

Rep Well this is all rather disappointing. Do you have any pearls of wisdom to impart?

GMcP Yes two. The first is this. If you know something important about your data but the software code you are using doesn't reflect this, something is almost certainly wrong. Experiments in which diets vary between halls are obviously very different to those in which they vary within. This requires that the data are analysed differently.

Rep And the second?

GMcP That's about correlation...

Rep Yes, I know. "Correlation is not causation." You statisticians are always banging on about that.

GMcP Yes. We are always being told that that is all we statisticians have to say about causation but the issue is more subtle than that. Correlation can be relevant to judging causation and here it had two effects that could easily be overlooked. First, the correlation between halls does not have to be the same as that within but the former is relevant to judging the effect of treatment whereas the latter is what a naive analysis may pick up. Second, random variation between halls induces a correlation: students in the same hall cannot be treated as being independent and this effects the calculation of the variance.

Joining the dots

Clearly this story is a farrago of utter nonsense so what is the point? It has a connection to Lord's paradox and I invite the reader to join up the dots for themselves. A previous blog of mine treats this. Some references (1-6) are given below. (Note that with the exception of the excellent analysis by Holland and Rubin, I do not agree 100% with these analyses except that by Senn.)

However, the lessons are far from being theoretical. For a genuine and famous experiment where similar issues arise, see Student's discussion of the Lanarkshire Milk Experiment (7).

Pseudoreplication (8) is relevant here and also to the analysis of Lord's paradox, although this has not always been appreciated.

* All analyses were performed with Genstat.

领英推荐

The Consequences of Statistical Misuse: From Health…

Dr. Abdulmuhsen Alrohaimi 6 个月前

Causal relationships in data

Murtaza Haider 2 个月前

T-Distribution and T-Test

Rany ElHousieny, PhD??? 1 年前

Appendix: code for Genstat analyses

Comments are in quotes "". The rest is code with procedure names underlined. All four analyses are equivalent.

"ANCOVA of summaries"

BLOCKSTRUCTURE Hall_S "Hall, 20 values"

TREATMENTSTRUCTURE Between_S "Diet, 20 values"

COVARIATE X_mean "Mean initial weight per hall, 20 values"

ANOVA[FPROBABILITY=Yes;PRINT=aovt, info, cova,effects] Yb_mean "Mean final weight per hall, 20 values"

"Analysis of original values"

BLOCKSTRUCTURE Hall/Student "100 students in each of 20 halls"

TREATMENTSTRUCTURE Between "diet given to each student, 2000 values"

COVARIATE X "initial weight of each student, 2000 values"

ANOVA[FPROBABILITY=Yes;PRINT=aovt, info, cova,effects] Yb "final weight of each student, 2000 values"

"Regression model using summaries"

MODEL Yb_mean

TERMS X_mean+Between_S

FIT [PRINT=model,summary,estimates; CONSTANT=estimate; FPROB=yes; TPROB=yes] X_mean+Between_S

"Equivalent mixed model"

"XMpH is the mean initial weight per hall but ascribed to each student and has 2000 values. XDiff is the difference between the student's initial weigh"

VCOMPONENTS [FIXED=XMpH,XDiff,Between; FACTORIAL=1] \ RANDOM=Hall; INITIAL=1; CONSTRAINTS=none

REML [PRINT=model,components,waldTests,effects;\ FMETHOD=automatic; \MVINCLUDE=*; METHOD=AI;\ ?MAXCYCLE=30] Yb; SAVE=_remlsave

I am grateful to members of the Genstat discussion list for help with formulating the mixed model. A relevant paper is by Mike Kenward and James Roger (9)

References

Lord's paradox

1. Holland PW, Rubin DB. On Lord's Paradox. In: Wainer H, Messick S, eds. Principals of Modern Psychological Measurement. Lawrence Erlbaum Associates; 1983:3-25.

2. Lord FM. A paradox in the interpretation of group comparisons. Psychological Bulletin. 1967;66:304-305.?

3. Pearl J, Mackenzie D. The Book of Why. Basic Books; 2018.

4. Senn SJ. Change from baseline and analysis of covariance revisited. Statistics in Medicine. 30 December 2006 2006;25(24):4334–4344.?

5. Van Breukelen GJ. ANCOVA versus change from baseline had more power in randomized studies and more bias in nonrandomized studies. Journal of clinical epidemiology. Sep 2006;59(9):920-5.?

6. Wainer H, Brown LM. Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data. American Statistician. May 2004;58(2):117-123.?

Lanarkshire milk experiment

7. Student. The Lanarkshire milk experiment. Biometrika. 1931:398-406.?

Pseudoreplication

8. Hurlbert SH. Pseudoreplication and the design of ecological field experiments. Ecological monographs. 1984;54(2):187-211.?

Mixed models

9. Kenward MG, Roger JH. The use of baseline covariates in crossover studies. Biostatistics. Jan 2010;11(1):1-17. doi:10.1093/biostatistics/kxp046

Stephen Senn

Statistical Consultant

2 年

I have now added a link to the data in case anybody else wishes to analyse them. analyses:

查看更多评论

要查看或添加评论，请登录

Stephen Senn的更多文章

May the fourth be with you

2025年3月14日

May the fourth be with you

Be merciless in your pedantry: give no quartile The photograph is of the Laxey Wheel on the Isle of Man . If you look…

11 条评论
Twin Piques

2025年2月19日

Twin Piques

..

5 条评论
Having a Sense of Proportion

2025年2月6日

Having a Sense of Proportion

The arguments are asymptotic but are relevant to situations where the sampling fluctuations are large enough to be of…

9 条评论
A Pronounced Mistake

2024年12月20日

A Pronounced Mistake

Narrow fabric I come from a family of ribbon makers whose business was based in Basle. In fact, ribbons were in the…

3 条评论
Match fit

2024年12月10日

Match fit

Matching and fitting in observational studies and the relevance or otherwise of the comparison with randomised studies…

16 条评论
Tensions over Testing

2024年8月25日

Tensions over Testing

Bear with me The navigational solution to getting off Ben Nevis is a technique called a ‘dog-leg’. This is a technique…
Beware of Interactions

2024年8月16日

Beware of Interactions

Parallel trials but not lines In a previous post I used an example from Chuang-Stein and Tong(1996) to illustrate…
The Main Chance

2024年8月12日

The Main Chance

Almost nobody on LinkedIn will remember The Main Chance, a British television series that ran from 1969-1975 featuring…

18 条评论
Bias Binding?

2023年8月22日

Bias Binding?

By randomizing the order in which the administrative regions change the treatment regimen, SWITCH SWEDEHEART overcomes…
Being Just about Adjustment in Clinical Trials

2023年7月14日

Being Just about Adjustment in Clinical Trials

Estimation of the magnitude of effects and of the relevant precision in general needs inclusion of strata parameters…

See all articles

Replication crisis

Stephen Senn

Statistical Consultant

The story so far

Drowning in data

When more is just the same

Joining the dots

领英推荐

Appendix: code for Genstat analyses

"ANCOVA of summaries"

"Analysis of original values"

"Regression model using summaries"

"Equivalent mixed model"

References

Lord's paradox

Lanarkshire milk experiment

Pseudoreplication

Mixed models

Stephen Senn的更多文章

社区洞察

其他会员也浏览了

T-Distribution and T-Test

Deciphering Statistical Significance

The Central Limit Theorem (CLT)

Optics of Rankings and the Reality of Data?Deficit

Outsmarting Misinformation: Is Psychological 'Immunity' the Key?

Expecting the unexpected vs not expecting the expected

Big Data Analysis Methods of Alcohol Profiles in Scotland

COVID-19 "Build Your Own Opinion" Toolkit

When data tells the inconvenient truth

Anecdotal, Preliminary, Trending, and Incomplete

The story so far

Drowning in data

When more is just the same

Joining the dots

领英推荐

Appendix: code for Genstat analyses

"ANCOVA of summaries"

"Analysis of original values"

"Regression model using summaries"

"Equivalent mixed model"

References

Lord's paradox

Lanarkshire milk experiment

Pseudoreplication

Mixed models

Stephen Senn的更多文章

May the fourth be with you

Twin Piques

Having a Sense of Proportion

A Pronounced Mistake

Match fit

Tensions over Testing

Beware of Interactions

The Main Chance

Bias Binding?

Being Just about Adjustment in Clinical Trials

社区洞察

其他会员也浏览了

T-Distribution and T-Test

Deciphering Statistical Significance

The Central Limit Theorem (CLT)

Optics of Rankings and the Reality of Data?Deficit

Outsmarting Misinformation: Is Psychological 'Immunity' the Key?

Expecting the unexpected vs not expecting the expected

Big Data Analysis Methods of Alcohol Profiles in Scotland

COVID-19 "Build Your Own Opinion" Toolkit

When data tells the inconvenient truth

Anecdotal, Preliminary, Trending, and Incomplete