Complicated COVID comparisons
A dram of data
A dram of data is worth a pint of pontification. My previous two blogs on the AZ/Oxford COVID vaccine results announced last month were rather speculative, since the press-release had been rather obscure. Admittedly, the story that had to be told was more complicated than had been the stories for the Pfizer/BioNTech and Moderna vaccines, because of various matters (of which more anon) but even so I think that the information could have been given more clearly.
The Lancet paper by Merryn Voysey et al published on 8 December 2020 clears up many matters and in this blog I shall comment on just three. First, why, despite the fact that my first blog got the subject numbers wrong, in the second blog, which used what was known about the case numbers and the claimed efficacy to reverse-engineer the case numbers, I got very close to the results now given in the paper. This is a matter which is of no practical importance (now that we have the figures, speculations about them are irrelevant) but it is relevant statistically, and that is what interests me. The second matter is what exactly one should conclude about the efficacy of the vaccine, given that there are two apparent doses. The third is what one should conclude, given that this is an interim analysis.
However, before I cover these three matters, there is one further one to note with pleasure, The Lancet paper has 82 authors, or about six authors for every ten cases of COVID infection. That's not the pleasing thing. What pleases me is that the lead author, Dr Merryn Voysey is a statistician. The trials presented in the paper have clearly involved huge amounts of work by many different researchers and, of course, the willing cooperation of very many subjects. However, anlaysing and interpreting the results will also have involved a great deal of work and it is nice to see a statistician leading the publication.
COVID Coverage
As the The Lancet paper puts it, "This analysis includes data from four ongoing blinded, randomised, controlled trials done across the UK, Brazil, and South Africa". These studies are:
- COV001 A single blind phase1/2 study in the UK
- COV002 A single blind phase 2/3 study in the UK
- COV003 A single blind phase 3 study in Brazil
- COV005 A double blind phase 1/2 study in South Africa.
(See pp 3 & 4 of the paper.)
However, as is made clear from Table 2 of the paper, the efficacy analysis is not based on all four studies but on studies COV002 and COV003 only. Two further points of carification are appropriate here. First, the two studies did not use exactly the same type of control. That fact will be ignored in what follows. Second, for various reasons, subjects in the early part of COV002 who were allocated to the vaccine were given a low dose (LD) for the first injection and a standard dose (SD) for their second one. For the latter part of the study the corresponding subjects were allocated SD on both occasions, as were the subjects given vaccine in COV003. That fact will not be ignored in what follows and for the purpose of analysis I created a factor stratum with three levels:
- First group in COV002
- Second group in COV002
- COV003.
For treatment, I again had a factor with three levels
- LD/SD
- SD/SD
- Control
That analysis can be represented by Figure 1 below. These show my naive analysis based on a logistic regression fitting stratum and treatment, and The Lancet analysis, which is more complicated, involving as it does a robust Poisson regression in which time of follow- up as an offset is also included. My analysis is given in blue and The Lancet paper analysis in red. There is virtually no difference between them.
Note that because an interim look was involved, the Lancet paper considers some 'alpha' as having been 'spent' and uses a 95.8% confidence interval for the combined efficacy of both doses, however conventional 95% limits are used for individual doses and I have adopted the same levels.
Figure 1 Three contrasts for the two AZ trials that are combined. Each contrast is to control and one is for low dose, one for high dose and one for the two doses combined. My analysis is in blue and placed higher for a given pair and the AZ one is in red and placed lower.
I shall now consider this efficacy analysis.
These are my numbers and if you don't like them I have others
In my previous posts I had the number of subjects being treated based on what I had gleaned from the press release, which stated this:
One dosing regimen (n=2,741) showed vaccine efficacy of 90% when AZD1222 was given as a half dose, followed by a full dose at least one month apart, and another dosing regimen (n=8,895) showed 62% efficacy when given as two full doses at least one month apart. The combined analysis from both dosing regimens (n=11,636) resulted in an average efficacy of 70%. All results were statistically significant (p<=0.0001)
I took these numbers to refer to the the number of patients give AZD1222. However, judging by the paper, they were not the numbers solely in the given dosing regimen but also included those in the accompanying control group. The net result is that I decribed the trials as having twice as many subjects in the various arms as there were. I then simply assumed that there were the same number of controls and proceed with my calculation accordingly.
In fact, this made almost no difference to the analysis. There are two reasons. The first is that the measure of vaccine efficacy is a relative reduction in risk. It is thus a simple matter to show (I shall spare you the details) that provided that the numbers of subjects on the vaccine and control arms are equal, the estimate of vaccine efficacy only despends on the relative numbers of cases. The second reason is that although it usually makes a very great difference to the reliability of a statistical estimate how many subjects you have in total, for logistic regression, if the proportion of cases is low, the variance is dominated by the number of cases and the number of non-cases plays almost no role. Figure 2 shows the standard errors of the two contrasts as a ratio of the minimum they could be if the trial were infinitely large keeping the number of cases constant but allowing the numbers on each arm to increase. It can be seen that this ratio hardly changes over the range of interest.
Figure 2. The standard error of two contrasts as a ratio of their asymptotic value shown as a function of the number of subjects in each arm.
An interesting issue remains, however: why is my simple analysis so similar to the more complex AZ/Oxford one? One reason, I think, is that the distribution of follow-up times is probably fairly evenly distributed. A further reason may be to do with the connection between a conditional Poisson analysis and a binomial one, such as in logistic regression. I would like say a lot more about this but I don't have space in the margin of this blog and so will leave you to speculate on what my cunning answer is.
Is pooling fooling?
However, what about the combination of both doses? Is this legitimate? The statistical analysis plan for the studies discusses in great detail whether pooling is legitimate (considering the extent to which matters such as measurments and populations might affect results) but, as far as I can see, does not really discuss the legitimacy of pooling dosing regimens. I can see three justifications
First, suppose we adopt as the null hypothesis that the vaccine has an efficacy less than some given level (20% and 30% are illustrated in figure 1) whatever the dose at which it is given. A valid test of this hypothesis is given by pooling both dosing regimens. However, if the null hypothesis is rejected, one may then conclude that the vaccine is efficacious at at least one dose but not at both. (See overstating the evidence and is pooling fooling.)However, as part of a so-called step-down procedure one could then test each dose for efficacy with no penalty for multiple testing.
Second, if one could believe in a monotonic increase in efficacy with increasing dose, one could then assume that by combining the two dosing regimens one would validly estimate a lower bound for the efficacy of the higher dose. Note that the fact that the lower dose posts higher efficacy does not invalidate this approach, provided that the assumption of monotonicity can be made. However, others knowledgeable in the subject of vaccine efficacy (in which I claim no expertise) have informed me that this assumption is not necessarily reasonable.
Third, one could accept the implied bias variance trade-off. That is to say that one could accept that in pooling the two results, one is estimating a mixed result that in an ideal world would not be valid for either regimen. However, that ideal world is one in which huge amounts of case data have been obtained for both regimens and their controls. The world we live in is not that world. Across these two studies there were only 131 cases in total. Therefore, the argument would go, the lesser of two evils will be to pool the results in order to get less variability in the answer.
Of these three arguments, the first is valid but only gets you so far. The second relies on an assumption that may not be valid. The third is a judgement call. However, what I can say is this. You cannot simultaneously talk up a possible vaccine efficacy of the LD/SD regimen of 90% and an overall efficacy of 70%.
On the other hand, perhaps you could argue as follows. Given that there appears to be a 'sigificant' difference in favour of LD/SD compared to SD/SD (I get P=0.038, if I may be permitted to use this despised measure), even if we cannot accept this as proof that the lower dose is better than the higher dose, can we not take it as evidence that it can hardly be appreciably worse? In that case can't we just use the lower dose and take the more modest value of 70% as our best guess of its efficacy rather than the 90%. It is true that the confidence interval is quite large but the lower bound, depending how you look at it, will be either above 67% (taking the results for LD/SD alone) or above 54% for the two combined and even 50% would be useful.
This in turn raises issues as to the appropriate standard of evidence. The common regulatory standard is two trials each significant at the 5% level not a combination of trials significant at the 5% level but even with this standard, the vaccine plausibly passes (albeit at different doses). See automatic for the people and chapter 12 of Statistical Issues in Drug Development, from which the illustration at the head of this blog is taken, for a discussion of the two trials rule.
The MHRA will have an interesting time with this.
Last but not least
The third of my discussion topics is the status of this report as regards interim analysis. The title of the paper is
Safety and efficacy of the ChAdOx1 nCoV-19 vaccine (AZD1222) against SARS-CoV-2: an interim analysis of four randomised controlled trials in Brazil, South Africa, and the UK
(My added emphasis.) The paper also mentions that an alpha spending function was being used and the statistical analysis plan states
Gamma Alpha-Spending function is used to control the overall Type 1 Error at 5% The planned alpha level is 1.13% for interim analysis and 4.44% for primary analysis.
It also states (p5)
The global pooled analysis plan allowed for an interim and a final efficacy analysis with α adjusted between the two analyses using a flexible gamma α-spending function, with significance being declared if the lower bound of the (1 ? α)% CI is greater than 20%.
However, it states in the paper (p5)
the alpha level calculated using the gamma alpha spending function for this analysis is 4.6%.
This seems to square neither with the plan nor with the confidence limit of 95.8%. As regards the former, I am quite prepared to believe that the spending function turned a planned rate of 1.13% into 4.6% if the information fraction at the look is large but this has the curious property of making the actual value used at the interim look greater than the planned look at the final look. This makes me think that the faction at interim must have been very large indeed but this raises the question as to why the trials were not run to conclusion.
As regards why a 95.8% interval was used when (1- α)% would be 95.4%, please give me your answers on a postcard or in your comments to the blog. (No doubt there is something that I have overlooked.)
Important update
The explanation has turned out to be simple. It's a blunder on my part (hangs head in shame). The alpha level used is 4.16%. not 4.6%. I am grateful to Merryn Voysey for very kindly and diplomatically pointing this out to me.
Finally
Or at least, finally ad interim, I shall make three points. First, what matters are not my back of the envelope calculations and my amateur musings but the analysis the project team have put together and the evaluation that the MHRA and others will make. Second, this is a much more complicated story than either Pfizer/BioNTech or Moderna (it seems) have to tell. Complicated stories are not necessarily less reliable than simple ones but they can be harder to check. Third, the paper states (p10)
The prespecified analysis population, which was determined following feedback from national and international regulators before unblinding of the study, included a pooled analysis from several countries to improve generalisability, and inclusion of two dose subgroups within the UK trial.
How do you unblind a single-blind study? By revealing to the patients what they got?
If the MHRA manage to do another heroic job with this one but the statisticians don't get a specific mention in the press briefing, I shall want to know why.
Declaration of Interest
I act as a consultant to the pharmaceutical industry. See Stephen Senn Declaration of Interest (senns.uk) for further details.
Distinguished Statistical Scientist at Analytix Thinking, LLC -- Fellow of the American Statistical Association
4 年Stephen, Thank you for your thoughtful comments. I cannot help but think about this study and its results in the context of the emerging "estimand" discussion. Having not studied the Lancet article or the SAP in detail (yet), I have to ask ... did the primary analysis state the use of intent-to-treat? Much of the discussion appears to be whether to analyze the data via ITT (i.e. pool the LD/SD and SD/SD stratum since that is a randomized group) versus analyze the data "as treated" (i.e. separate the LD/SD and SD/SD groups). The question is, "What is the question?" WHAT are we trying to estimate? In answering this question, we must consider what is best for the patient. I won't propose an answer here (maybe I will have to blog about this as well), but it would be more satisfying to me if the discussion started with understanding WHAT we are estimating and then pursued appropriate estimators and ultimately estimates from which decisions can be made about the appropriate question. Clearly the trial did NOT start out to answer the questions of any difference between LS/SD and SD/SD, but the mistake must have been known prior to the analysis. What did the new SAP say about WHAT was to be estimated> Finally, Devan's comment is helpful, but I am always suspicious of post hoc explanations (as we all should be). I am not experienced in anti-viral development, but if this notion of "priming the immune system" is valid and well-understood, then (a) as you have pointed out, why wasn't this part of the study design/treatment regimen and (b) do other/all multi-dose vaccines use a LD followed by SD dosing strategy? If not, why not? The Pfizer and Moderna vaccines do not take this approach and appear to achieve 90+% efficacy (though admittedly they are mRNA vaccines and may have fundamentally different modes of action regarding the immune system - but I am way out of my lane here). Thanks again! This will be an interesting challenge for regulators.
Member of the Academic Council
4 年There you go for pre-registration as an antidote to selective inference. It is not. Stephen has done work that would dignify Sherlock Holmes. Fantastic read. Thank you.
Executive Director, Global Group Head Biostatistics - Oncology at Novartis
4 年Stephen, keep publishing ... interesting stuff as always !!?
Professor at Aarhus University
4 年Given that the LD/SD efficacy is 90% = 1-3/1367/(30/1374) ie given only 3 LD/SD cases, are any model-based test, comparison or CI of much value? I guess that the asymptotics has not kick-in in the robust age-adjusted Poisson regression. Also somewhat nonsensical LD/SD has higher efficacy than SD/SD. Too little information, perhaps? Thank you for a brilliant exposition the Lancet paper
Statistical Consultant
4 年Yes. Red faces all round I think. They seem to be making a pretty decent job of dealing with it but the effective amount of information available is less than for the Pfizer/BioNTech or Moderna vaccines.