Replication crisis
The story so far
In a previous blog, Halls of Fame, I explained how the original data for a dietary experiment involving 20 halls with 100 students per hall were lost. Summary figures at the start of the academic year (baseline) and at the end (outcome) were available. Using these figures of means only, veteran curmudgeon Guernsey McPearson was able to produce an analysis as follows:
Then, miraculously, the original individual data per student were discovered. What happened next? I shall now reveal all.
Drowning in data
The Newtrition research team wasted no time in getting to work on the newly discoverd treasure-trove of original data. A meeting with Guernsey McPearson (GMcP) was requested by a representative (Rep) of the team and went something like this.
Rep Great news, Guernsey. You know that "trend towards non-significance" you were talking about. That is now a whoppingly significant result and not the puny P=0.023 you found. In fact, when we used the original data, not only was P<0.001 it was completely off the scale.
GMcP I can imagine. How did you handle the "hall" effect?
Rep Well we did puzzle about that. We put "hall" in as a factor but it failed to compute and gave us some strange messages, so we took it out again. Et voila.
GMcP Ah yes. You had hall as a fixed effect. You should, of course, have had it as random effect. Handling the initial weight as a covariate would, however, have been tricky. Basically you should just have stuck with the summary measures approach. Handled correctly, the individual student data would have given you the same answer.
Rep What? How can that be.
GMcP Give me the original data and I shall show you.
When more is just the same
GMcP was as good as his word. He returned with the following analysis* of the original 2000 values and not just the 20 means per hall.
GMcP You see, as I predicted, the result is just the same.
Rep But how can this be? You only analysed the 20 summary statistics before. Now you have analysed the 2000 values and you get the same answer? Looking at the table the students don't seem to be doing anything useful when it comes to judging the diet effect.
GMcP The data are hierarchical. There are two variances, between-hall and within-hall and two covariances between and within. However, you varied the diet between halls...
Rep That was a practical necessity!
GMcP I don't dispute it. I think that you were right to do so. But Nature is unsympathetic and neither a respector of motives nor of practical difficulties. Since diet varies between halls, the between-hall variance was relevant to judging the effect of diet and the between-hall covariance also. You have to watch out for the dangers of pseudoreplication.
Rep But does that mean it was pointless measuring 100 students per hall? Wouldn't 50 or 10 or even one have done just as well?
GMcP Oh no, not at all. The more students you measured, the greater the precision with which you measured things within each hall but this increase in precision was already reflected empirically in the means you calculated. I used these means in my previous analysis and in using them to estimate the variance I automatically, and without having to model it explicitly, took account not only of the between-hall variation but of the contribution that within-hall variation made to overall uncertainty. Going to the original data, however, I had to take care to model the various effects appropriately.
Rep Well this is all rather disappointing. Do you have any pearls of wisdom to impart?
GMcP Yes two. The first is this. If you know something important about your data but the software code you are using doesn't reflect this, something is almost certainly wrong. Experiments in which diets vary between halls are obviously very different to those in which they vary within. This requires that the data are analysed differently.
Rep And the second?
GMcP That's about correlation...
Rep Yes, I know. "Correlation is not causation." You statisticians are always banging on about that.
GMcP Yes. We are always being told that that is all we statisticians have to say about causation but the issue is more subtle than that. Correlation can be relevant to judging causation and here it had two effects that could easily be overlooked. First, the correlation between halls does not have to be the same as that within but the former is relevant to judging the effect of treatment whereas the latter is what a naive analysis may pick up. Second, random variation between halls induces a correlation: students in the same hall cannot be treated as being independent and this effects the calculation of the variance.
Joining the dots
Clearly this story is a farrago of utter nonsense so what is the point? It has a connection to Lord's paradox and I invite the reader to join up the dots for themselves. A previous blog of mine treats this. Some references (1-6) are given below. (Note that with the exception of the excellent analysis by Holland and Rubin, I do not agree 100% with these analyses except that by Senn.)
However, the lessons are far from being theoretical. For a genuine and famous experiment where similar issues arise, see Student's discussion of the Lanarkshire Milk Experiment (7).
Pseudoreplication (8) is relevant here and also to the analysis of Lord's paradox, although this has not always been appreciated.
* All analyses were performed with Genstat.
领英推荐
Appendix: code for Genstat analyses
Comments are in quotes "". The rest is code with procedure names underlined. All four analyses are equivalent.
"ANCOVA of summaries"
BLOCKSTRUCTURE Hall_S "Hall, 20 values"
TREATMENTSTRUCTURE Between_S "Diet, 20 values"
COVARIATE X_mean "Mean initial weight per hall, 20 values"
ANOVA[FPROBABILITY=Yes;PRINT=aovt, info, cova,effects] Yb_mean "Mean final weight per hall, 20 values"
"Analysis of original values"
BLOCKSTRUCTURE Hall/Student "100 students in each of 20 halls"
TREATMENTSTRUCTURE Between "diet given to each student, 2000 values"
COVARIATE X "initial weight of each student, 2000 values"
ANOVA[FPROBABILITY=Yes;PRINT=aovt, info, cova,effects] Yb "final weight of each student, 2000 values"
"Regression model using summaries"
MODEL Yb_mean
TERMS X_mean+Between_S
FIT [PRINT=model,summary,estimates; CONSTANT=estimate; FPROB=yes; TPROB=yes] X_mean+Between_S
"Equivalent mixed model"
"XMpH is the mean initial weight per hall but ascribed to each student and has 2000 values. XDiff is the difference between the student's initial weigh"
VCOMPONENTS [FIXED=XMpH,XDiff,Between; FACTORIAL=1] \ RANDOM=Hall; INITIAL=1; CONSTRAINTS=none
REML [PRINT=model,components,waldTests,effects;\ FMETHOD=automatic; \MVINCLUDE=*; METHOD=AI;\ ?MAXCYCLE=30] Yb; SAVE=_remlsave
I am grateful to members of the Genstat discussion list for help with formulating the mixed model. A relevant paper is by Mike Kenward and James Roger (9)
References
Lord's paradox
1. Holland PW, Rubin DB. On Lord's Paradox. In: Wainer H, Messick S, eds. Principals of Modern Psychological Measurement. Lawrence Erlbaum Associates; 1983:3-25.
2. Lord FM. A paradox in the interpretation of group comparisons. Psychological Bulletin. 1967;66:304-305.?
3. Pearl J, Mackenzie D. The Book of Why. Basic Books; 2018.
4. Senn SJ. Change from baseline and analysis of covariance revisited. Statistics in Medicine. 30 December 2006 2006;25(24):4334–4344.?
5. Van Breukelen GJ. ANCOVA versus change from baseline had more power in randomized studies and more bias in nonrandomized studies. Journal of clinical epidemiology. Sep 2006;59(9):920-5.?
6. Wainer H, Brown LM. Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data. American Statistician. May 2004;58(2):117-123.?
Lanarkshire milk experiment
7. Student. The Lanarkshire milk experiment. Biometrika. 1931:398-406.?
Pseudoreplication
8. Hurlbert SH. Pseudoreplication and the design of ecological field experiments. Ecological monographs. 1984;54(2):187-211.?
Mixed models
9. Kenward MG, Roger JH. The use of baseline covariates in crossover studies. Biostatistics. Jan 2010;11(1):1-17. doi:10.1093/biostatistics/kxp046
Statistical Consultant
2 年I have now added a link to the data in case anybody else wishes to analyse them. analyses: