It's harder than you think
From outwards to inwards
In a previous blogpost, Being wrong about efficacy probabilities, I pointed out that one had to be careful about the inferences one could make from clinical trials and in particular the 'population' to which they applied. The safest conclusions were when the population was restricted to that of the patients in the trial and the trial was regarded as a random realisation of many that might have been run in that population, that is to say one of many keeping the same patients but simply changing the allocation to experimental and control treatments.
If we wanted to make inferences more widely, that is to say about patients not in the trial, we had to appreciate that this would depend on strong assumptions, for example that the patients studied were somehow, despite merely constituting a convenience sample, representative of some target population of interest. An alternative assumption might be that the subjects, although not representative, could nevertheless provide information that was transferable to a population of interest given a suitable scale of measurement. An extreme example of this is bioequivalence studies, usually carried out in healthy volunteers. Here it is assumed that although absolute bioavailability will be very different in the subjects in the trial than in patients who might be given the generic or brand name drugs being studied, the relative bioavailability of the two would be very similar. See Invisible Statistics for a discussion.
In this blog I propose to look inwards. What can we say about the effects of treatment for the individual patients in the trial rather than for the patients on average? As I shall demonstrate, this turns out to be surprisingly difficult.
What's the difference?
Figure one shows some results from a clinical trial in asthma, MTA/02, of the bronchodilator formoterol. The design and results are discussed in detail in an article I co-authored many years ago (1). Here I shall only give brief details.
Figure 1. First period values (FEV1 readings 12 hours after treatments) for two of the seven treatments compared in MTA/02.
Seven treatments, three doses of each of two formulations and a placebo, were compared for their effect on lung function as measured by forced expiratory volume in one second (FEV1), measured 12 hours after treatment. (High values of FEV1 are good.) The design was that of an incomplete blocks cross-over: each patient was treated in five periods only and therefore received only five of the seven treatments. Twenty one sequences of five of the seven treatments were used, each patient being randomised to one of the 21 sequences and the sequences being chosen in such a way that each of the possible 7 x 6/2 = 21 pairs of treatments were equally represented and also that each treatment appeared equally often in each period.
The data in figure 1 show the results for only two of the treatments and only for the first period. Of course, they only represent a fraction of the information in the trial and should not be used to come to any firm conclusions. However, if these were the only data we had, they would have the structure of a conventional parallel group trial.
Two simple statistical tests may be used in the now (nearly) universally despised asessment of significance. We can compare the two groups using a t-test or a Mann-Whitney Wilcoxon (MWW) test. The results are shown in Figure 2.
Figure 2. Extract of a statistical analysis of some data from MTA/02
Whether the parametric t-test or the non-parametric MWW test is used, the result is significant at the 5% level, the P-values being 0.023 and 0.035 respectively. If we accept this as indicating a genuine difference between treatments, then, since high values of FEV1 are good, it seems that on average ISF 24 is superior to placebo.
But what is the difference between the two treatments? The parametric procedure estimates an effect of 0.53 L with a 95% confidence interval of 0.08 L to 0.98 L. The non-parametric analysis quotes the median difference. Note that this is not the difference of medians. It is the median of all 25 x 21 = 525 possible differences that can be formed by pairing each of the 25 values under treatment with the 21 values under placebo. This is sometimes referred to as the Hodges-Lehmann estimate (HLE) and here it is equal to 0.39 L and the 95% confidence interval for the 'true' effect is 0.02 L to 1 L.
Beware the overlap trap
However, whether we use the t-test or the MWW test, what we are adressing is some sort of average effect, estimated by the difference in means in the first case and the HLE in the second. Can we say something about effects for the individual patient?
Closely related to the MWW test is the idea of an overlap probability. If X and Y stand for the two random variables being compared, then this is, 'the probability that random variable?X?is lower than random variable?Y' (2) (p3755). (In a practical application we could take X to be a measurement in the control group and Y in the experimental group.) This can be estimated directly from the 25 x 21 = 525 pairwise differences used to calculate the HLE and noting how many are positive. If this is done it can be found that 357 of the pairwise differences are positive, 166 are negative and 2 are zero. One could argue what one should do with the two ties but here I shall ignore them. Calculating 357/525 = 0.68 we then have an estimate of the probability that a randomly chosen patient given the treatment would have a higher FEV1 values than a randomly chosen patient given placebo. Alternatively, the probability of this not being the case is 1-0.68 = 0.32. The situation is illustrated in Figure 3.
Figure 3. Cumulative plot of the 525 pairwise differences against the normalised rank indicating the HLE of 0.39 and the proportion 0f 0.32 for which differences were negative.
An analogous estimate can be calculated using the parametric approach. Here we use the formula (3)
where the numerator of the term in brackets is the difference between the two means, the denominator is the estimate of the standard deviation of the difference between two observations (hence the factor root two), s is the estimated standard deviation of the original observations (based on both groups) and the function is that of the cimulative standard Normal (3). If we apply this formula here, we get a value of p = 0.69 and therefore D = 0.31. The former figure is the probability that a randomly chosen patient from the treatment group will have a superior value to a radomly chosen patient from the control group. Previously we calculated this to be 0.68, so the value of 0.69 is very similar to that previously calculated.
So what use are such probabilities? As an article (3) in The American Statistician puts it "Simply put, the D-value is the proportion of patients who got worse after the treatment" (p37). This is simply put but it is simply wrong, as I shall now demonstrate.
Cause for concern
Let us consider two extreme cases. In the first case, the patients are essentially identical and the random differences we see in outcome values for patients in a given group are simply reflections of the fact that results vary from occasion to occasion. In the second case, the values are permanent. On whichever occasion we choose to measure them the measurements will be the same.
Now, in the absence of any other knowledge, if case one applies, our best guess for a patient's value next time we measure them (but do not change the treatment) will not be the value for them we have just observed but the mean for the group to which they were assigned. Similarly, our best guess for what the value would be were we to switch the treatment is the mean value for the other group. That being so, we simply estimate the same benefit of treatment for every patient: the difference between the two means. The question of a probability of response is meaningless.
Now, suppose that case two applies and we consider a patient who has a value at centile 24 for the intervention group. Figure 4 shows the plot of the FEV1 values for all patients and it can be seen that this patient has a value of 1.87 L. Now, in calculating the 525 pairwise differences we used to estimate the overlap probability we calculated 21 for this particular subject (the subject was compared to each of 21 placebo patients) and clearly most of them will have been negative rather than positive. However, a little thought should show that given what we have assumed about the values, they are also irrelevant.
领英推荐
Figure 4. Cumulative distributions for the two treatments studied in MTA/02.
Given that (by assumption) we regard the treatment effect for this patient as being stable, why should we compare the result for this patient to every single one under placebo, a very unstable collection? Would it not make more sense to compare the patient to the corresponding patient (at the corresponding centile) in the placebo group? This has been done in Figure 3, where it can be seen that the value for this patient is 1.32. Furthermore, if we compare the two distributions, we can see that the effect for every single patient in the treatment group is better than the result for the corresponding patient in the control group. Taking this perspective, we could argue that the treatment works for everyone.
Of course, we don't know that the stories I have given are true but we don't know that they aren't. In any case, it is clearly simply untrue that the overlap probability tells us how many patients benefitted.
Can we do better?
In parallel group trials it is very difficult to address the issues of individual response. The logic of such designs is that it takes dozens, sometimes hundreds and even thousands of patients to tell whether the treatment works at all. Replication is of the essence in judging causality but replication is not provided at the level of the individual patient.
However, MTA/02 was not a parallel group trial but an incomplete blocks cross-over. Some of the patients received ISF 24 and also placebo. What happens if we look at these patients? Figure 5 gives a scatter plot of the ISF 24 values plotted against the placebo value. If a point lies to the left and above the diagonal line of equality then the ISF 24 value was higher and if it lies below and to the right the placebo value was better.
Figure 5 Scatter plot of FEV1 values for those patients in MTA/02 who received both ISF24 and placebo.
An analysis of these results is given in Figure 6. This analyses the pairwise differences, referred to as basic estimators in my book on cross-over trials (4). Out of 63 such values, 51 are positive, suggesting a response probability of 0.81, rather higher than the value of 0.68 we previously estimated.
Figure 6. Analysis of the differences in FEV1 (ISF24 - Placebo) for those patients in trial MTA/02 who took both treatments.
Of course, these are (mainly) a different set of patients to those examined previously. Nevertheless, a reason we carry-out cross-over trials is that we think that values will vary from patient to patient, even if given the same treatment and if that is so, it is generally to be expected that the apparent response probability will be higher when measured this way.
Are we done?
Not quite. My personal opinion is that the value of 0.81 will be an underestimate of the proportion of patients benefitting from the treatment. It would apply if all the variation in the basic estimators were treatment-by-patient interaction. However, if the values that patients show when given the same treatment vary from occasion to occasion. it is quite possible that the means of basic estimators by patient (averaged over trials), were it possible to repeat the trial, would have an even larger proportion of positive values. In fact, such replicate cross-over designs, where they can be run, are a superior approach to studying individual response(5).
Summing up
I have previously pointed out that Numbers Needed to Treat cannot be used to assess the proportion of responders in clinical trials. Overlap probabilities can be added to the list of techniques that don't work. What is necessary is careful examination of components of variation and clever approaches to design and analysis(5). Next time you meet a personalised medicine enthusiast (they are not hard to find) talking about the proportion of patients who do or don't respond, ask them how they know.
Further reading
A classic paper on overlap probabilities is that of David Hand (6). They are also treated by me in a festchrift to Gary Koch (7).
References
1. Senn SJ, Lillienthal J, Patalano F, Till MD. An incomplete blocks cross-over in asthma: a case study in collaboration. In: Vollmar J, Hothorn LA, eds. Cross-over Clinical Trials. Stuttgart: Fischer; 1997:3-26.
2. Perme MP, Manevski D. Confidence intervals for the Mann-Whitney test. Stat Methods Med Res. 2019 Dec;28(12):3755-3768. doi: 10.1177/0962280218814556. Epub 2018 Dec 4. PMID: 30514179.
3. Demidenko E. The p-value you can’t buy. The American Statistician. 2016;70(1):33-38.
4. Senn SJ. Cross-over Trials in Clinical Research. Second ed. Chichester: Wiley; 2002.
5. Senn SJ. Mastering variation: variance components and personalised medicine. Statistics in Medicine. 2016;35(7):966-977.
6. Hand DJ. On comparing two treatments. The American Statistician. 1992;46(3):190-192.
7. Senn SJ. U is for Unease: Reasons to mistrust overlap measures in clinical trials. Statistics in Biopharmaceutical Research. 2011;3(2):302-309.
Head, Data Quality @ N-Power Medicine, Inc. | Epidemiologist
3 年Thanks for the great post. Population based results become personalized when physician models them with the totality of clinical data of that individual patient in the clinic..
Lecturer - Public Health & Community Health UMF Carol Davila - Faculty of Medicine
3 年The 'replicate cross-over design, where can it be done...' Thank you for summing up the essence. Brilliant as always. Great example