The Death of the P-value (or The Untenable Nature of Bayesian Inference)
In any quantitative field it is not enough to simply apply a set of mathematical operations. One must also provide an interpretation. The field of statistics concerns itself with a special branch of mathematics regarding probability. When interpreting probability there are primarily two competing paradigms: frequentist and Bayesian. These paradigms differ on what it means for something to be considered random and what probability itself measures. A common example used to compare frequentist error rates and p-values with Bayesian posterior probabilities is the analysis of a cancer screening test, or in the current climate a COVID-19 screening test. Often times the aim of such an example is to herald the death of the p-value (or at least discourage its use) by demonstrating how it ignores relevant information outside of the test that can only be captured using a Bayesian approach. Such an example may also aim to show the relative ease of interpreting Bayesian inference compared to a frequentist solution using p-values. In this context the example below provides a thorough comparison and interpretation of these probabilities, and highlights the untenable nature of Bayesian inference. This example is important because the interpretation carries over to all areas of statistical inference.
Operating Characteristics of a Cancer Screening Test
Consider the 3x3 table below depicting the operating characteristics of a cancer screening test with 0.85 specificity and 0.80 sensitivity. The parameter space is shown across the top of the table and the support of the sampling distribution (test result) is displayed along the left side of the table so that this table is read vertically. If a subject truly has No Cancer the screening test will produce a Negative result, an At Risk result, and a Positive result 85%, 10% and 5% of the time respectively. Likewise, if the subject indeed has Cancer the test will return a Negative result, an At Risk result, and a Positive result 5%, 15%, and 80% of the time respectively. These long-run probabilities can be verified within a margin of error through repeated testing.
The power of the test shows the ex-ante sampling probability of observing an At Risk or Positive result testing the hypothesis Ho: No Cancer as a function of the unknown true cancer status for the subject at hand. This shows the long-run probability of “rejecting” or “ruling out” the hypothesis Ho: No Cancer depending on the subject’s true cancer status. If the subject truly has No Cancer we would incorrectly “reject” or “rule out” Ho: No Cancer 15% of the time. If the subject truly has Pre-Cancer we would correctly “rule out” Ho: No Cancer 60% of the time. If the subject indeed has Cancer we would correctly “rule out” Ho: No Cancer 95% of the time. This long-run probability forms the level of confidence in the next observed test result for the subject.
Frequentist Inference
The p-value represents the plausibility of a hypothesis given the data – the ex-post sampling probability of the observed result or something more extreme if the hypothesis is true. The p-value function testing Ho: No Cancer, Ho: Pre-Cancer, and Ho: Cancer as a function of the hypothesis and the observed data is read horizontally. For an At Risk result, both the upper-tailed and lower-tailed p-value are displayed Ho: Pre-Cancer. If an At Risk test result is produced for a given subject, the upper-tailed p-value testing the hypothesis that the subject at hand has No Cancer is the probability of an At Risk or more extreme (Positive) test result given the subject has No Cancer, 0.10 + 0.05=0.15. Likewise, for the same At Risk result the lower-tailed p-value testing the hypothesis that the subject at hand has Cancer is the probability of an At Risk or more extreme (Negative) test result given the subject has Cancer, 0.15+0.05=0.20.
The confidence level is a function of the hypothesis and the observed data. This table is read horizontally and shows that if the test returns an At Risk result we can "rule out" Ho: No Cancer at the 15% level and Ho: Cancer at the 20% level and are therefore 65% confident in the alternative, which is Pre-Cancer. The 65% confidence level is nothing more than a restatement of the p-values testing Ho: No Cancer and Ho: Cancer, 100(1–0.15–0.20)%. Similarly, if the test returns a Positive result we can "rule out" Ho: Pre-Cancer (and by extension Ho: No Cancer) at the 10% level, and are therefore 90% confident in the alternative, which is Cancer. Either the subject has Pre-Cancer (or No Cancer) and we have witnessed a 10% (or smaller) event, or the subject indeed has Cancer.
领英推荐
Bayesian Inference
If we have verifiable knowledge that a given subject was randomly selected from an irreducible population that has No Cancer, Pre-Cancer, and Cancer in a 4:2:1 ratio, then the Bayesian posterior depicts the long-run probability of cancer status among randomly selected subjects, given a particular test result. In this context these posterior probabilities are often referred to as negative predictive value, false omission rate, false discovery rate, and positive predictive value. This long-run probability can be used to make inference on the cancer status of the subject at hand by imagining the subject was instead randomly selected from the posterior distribution. This is a direct contradiction to the earlier claim that the subject at hand was randomly selected from the prior distribution. The posterior sampling frame is correct only if the prior sampling frame is correct, yet there can only be a single sampling frame from which we obtained the randomly selected subject at hand.
This line of reasoning is all too common , yet the internal contradiction of claiming two dependent sampling frames is never discussed. Other times an example using Bayes' theorem is presented using a long-run frequency interpretation to promote the Bayesian paradigm, yet it has nothing to do with Bayesian statistics (there is no prior to posterior transformation). If we really do have verifiable knowledge about how a given subject was randomly selected, this information can be presented alongside the p-value. In practice, though, we generally do not have such verifiable knowledge. The Bayesian prior and posterior probabilities might instead be interpreted as measuring the unfalsifiable subjective belief of the experimenter regarding the cancer status of the subject at hand, rather than long-run proportions of cancer status among randomly selected subjects. This subjective belief is not a verifiable statement about the actual parameter (true cancer status), the hypothesis, nor the experiment. It is a statement about the experimenter. No matter what number the experimenter produces, he/she is always right. How can anyone claim to know the experimenter's beliefs better than the experimenter? If the Bayesian prior distribution is chosen in such a way that the posterior is dominated by the likelihood or is proportional to the likelihood, Bayesian belief is more objectively viewed as confidence based on frequency probability of the experiment.
Approximate P-value Functions
The likelihood is identified by reading the table of operating characteristics horizontally. The normalized likelihood can be seen as a posterior based on a 1:1:1 prior. It is more objectively viewed as an approximate p-value function. The normalization smooths the operating characteristics of the screening test so the probabilities sum to 1 over the parameter space. The plug-in sampling distribution transposes the operating characteristics of the screening test across the parameter space and also works as a crude approximate p-value function.
All five methods above use the sampling behavior of the screening test to form a "distribution estimate" of cancer status. In this setting the p-values do not form a distribution function on the parameter space (they do not sum to 1). If an additional follow-up test is to be conducted on the subject at hand, these distribution estimates can be used to perform inference on the power of the future test. If one is not satisfied with this inference on power, a more sensitive and specific test can be sought. Regardless of paradigm, multiple tests can be performed and the results convolved through the likelihood in a meta-analysis to improve the inference on the true cancer status for a given subject.
Summary
Not every invocation of Bayes' theorem is an instance of Bayesian statistics. The Bayesian posterior is often presented as depicting long-run frequency probabilities of hypotheses to give it a sense of objectivity, but this leads to an internal contradiction and is not a valid form of inference. The Bayesian interpretation of probability as a measure of belief is internally consistent given its premise, but is unfalsifiable. Neither of these interpretations is enough to conclude that there is relevant external information being omitted when performing inference using p-values. The bottom line is that interpretation matters. The next time someone insists on making a probability statement like, "There is 78.4% probability that... ," be sure to ask: "78.4% of what?"
If I have made any mistakes please let me know in the comments below.
#Statistics #BayesianStatistics??
The example in the article above was taken from here, https://www.researchgate.net/publication/350055507_Decision_Making_in_Drug_Development_via_Inference_on_Power
If you enjoyed the article above you may also enjoy this one, https://www.dhirubhai.net/pulse/does-bayesian-probability-success-help-drug-geoffrey-johnson/