What did the New York Times article get wrong about NIPT? Part two: We have to cut somewhere
Now, let’s talk about the cause and impact of low Positive predictive value (PPV) on medical screening, such as NIPT or non-invasive oncology testing.
From the part one farmer vs. librarian story, we learned (read it here):
Let’s apply these statistical principles to the diagnostic screening and see where the NYT paper got wrong.
Screening Tests vs. Diagnostic Tests, what are the differences?
First of all, what are the differences between a screening and a diagnostic test? The Cleveland Clinic gave a very nice illustration here.
The core difference is, really, who will take these tests? Screenings are methods designed to be taken by the general population without symptoms to avoid devastating situations (usually later). If you are to design a screening method, what criteria must this method meet?
To me, these are “must-haves”:
This list lets you easily see why many screening methods are “non-invasive”. Non-invasive testing, which has very low risk, usually uses indirect measurement. Therefore, they may have a higher false positive (FP) and false negative rate (FN) than a diagnostic test. But this is not always the case. Even with the same accuracy, a screening test will have a lower PPV than a diagnostic test. The fact that people taking diagnostic tests already have symptoms, including a positive screening result, means the prior (background possibility) for having a certain disease is much higher for diagnostic tests.
As long as the two distributions overlap, we have to “cut” somewhere
If thinking about the Confusion Matrix is too confusing, let me illustrate with my ugly hand-draw plot below. A screening test will face the healthy and diseased populations and try to tell which person belongs to which group. The rarer the disease, the bigger the difference between the size of the two groups. ?
Although we want a perfect test that makes 0 mistakes in assigning patients to the correct group, the reality is that as long as there is overlap in the measurements between the distribution of the “Healthy” and “Diseased” populations, we have to decide where to cut, thus creating wrong assignments.
The cut-off shown here is very good in terms of specificity and sensitivity. Among all healthy people, a small number are called “diseased”—the ratio of the area of the blue triangle to the overall area under the blue line is very small. And among diseased people, the same thing: the ratio of the orange triangle over the area under the orange curve is small.
However, due to a much bigger population, even with this very good cutoff, the tiny tail of the healthy people above the threshold (blue triangle) still overwhelms the disease cases called (area under the orange curve to the right of the threshold), therefore, creates a low PPV. It is inevitable that some healthy people will be called diseased (FP), and some diseased people will be called healthy (FN).
FP and FN do not carry the same cost
From a purely mathematical point of view, the optimal cut-off should be where the “Healthy” and “Diseased” cross, as the middle plot shown above. At this point, the total number of FP and FN is minimal. This is under the assumption that FP and FN have equal costs. However, in real life, FP and FN do have different costs, so the optimal cutoff may not be in the middle.
As we discussed, a screening method usually tries to prevent a relatively rare condition for some (later) devastating events.
The costs of FN, i.e., failure to report disease to a patient, may include:
领英推荐
The costs of FP, i.e., mistakenly reporting a disease to a healthy individual, may include:
We can see that these issues and costs are usually complicated and hard to calculate precisely. However, besides the last one in the reporting FP, the cost of an FN is much higher.
We also need to consider a practical issue regarding the reporting threshold. Because the vast majority of people will get negative results, no follow-up test will be performed on them. So, there is no way to “catch” these FNs until much later, when they start showing symptoms. On the contrary, since only a very small number of people will get positive results, follow-up confirmation tests can be done to correct the screening results.
Indeed, a follow-up diagnostic test is always recommended for people with a positive screening test result. This step will minimize the devastating results from an FP in screening tests like wrong abortions.
Based on these considerations, many screening tests designed the reporting threshold to have a higher FP rate than the optimal threshold for minimal total FP and FN. So, in some cases, a low PPV is not only expected but desired.
What did the New York Times article get wrong about NITP?
The NYT article “discovered” that many NIPT tests have a low positive prediction value (PPV) on rare conditions, thus questioning their validity. Unlike specificity and sensitivity, PPV is not an intrinsic characteristic of a test. A low PPV does NOT dismiss the validity of NIPT tests. Hopefully, by now, I have convinced you that:
Thus, using low PPV to invalidate a screening test is fundamentally wrong. Of course, it is unfortunate that these FP patients have to go through extra stress and unnecessary diagnostic procedures, like amniocentesis. However, let’s not forget the screening method used before NIPT. Doctors used first-trimester screening methods (ultrasound + blood tests with the mother’s age) to get information about the fetus's risk of chromosomal disorder. The criteria for recommending further testing, such as amniocentesis or CVS, is 1 in 250 chance or greater. That is to say, the old screening method’s PPV is 1/250=0.4%, much lower than the PPV rates of NITP that the NYT article criticized.
In fact, the number of amniocenteses performed has dramatically reduced since medical professional organizations recommended NIPT in the mid-2010s. NIPT is much more accurate than the old first-trimester screening method. Thanks to the wide use of NIPT, many more families do not need to undergo unnecessary, riskier amniocentesis or CVS compared with the "good old days", quite the opposite picture of the NYT article painted.
Is there hope to have an almost perfect screening test?
Now we know that the low PPV of rare conditions from screening tests is mainly caused by FPs from the much bigger healthy population. If we can have tests with an FP rate close to the rareness of the disease, i.e., very high specificity, we can have a high PPV even for ultra-rare diseases. Is that possible?
As we discussed at the beginning of this article, a population screening test has to be relatively cheap and easy to perform. Increasing specificity to the level of healthy population percentage for a rare disease (1—prevalence of disease) without dramatically increasing cost is very difficult.
However, I remain hopeful. If you look at the ugly hand draw plot, you know the cutoff is decided by one dimension of measurements. If we add one more data dimension, the healthy and diseased populations may separate better. The better the distribution is separated, the easier it is to assign people to the correct category. Besides testing data, we can collect data from other sources and in other modalities, like other bloodwork results, ultrasound images, the mother’s personal and family history, lifestyle, and many more, and use better tools like multi-modality AI. In that case, I am confident that almost perfect screening tests are possible in the not-too-far-distant future.
References:
Regional Oncology Specialist II Exact Sciences
8 个月You have to understand the complexity and context before you can simplify and you do this masterfully Shan! Very timely
Professor at Rutgers University
8 个月Very nicely done! It is hard to make this concept accessible. ??
Tireless Advocate for Healthy Babies for All Families via Newborn Screening
8 个月Well done! Thanks for the effort to bring more clarity to such an important topic. I hope the editors at the NEWYORKTIMES.COM can see these two posts.