What did the New York Times article get wrong about NIPT? Part two: We have to cut somewhere

What did the New York Times article get wrong about NIPT? Part two: We have to cut somewhere

Now, let’s talk about the cause and impact of low Positive predictive value (PPV) on medical screening, such as NIPT or non-invasive oncology testing.

From the part one farmer vs. librarian story, we learned (read it here):

  1. The performance, sensitivity ("Recall" in data science), and specificity of a test like the bookworm test will be intrinsic to the tests.
  2. The population ratio of the two groups will impact the positive predictive value (PPV, aka "Precision") and negative predictive value (NPV). When this ratio is high, the impact is huge. For rare conditions, the PPV will be low despite the high accuracy of the test.
  3. Even with a low PPV value, a positive result may still be very informative to the individual because it gives a much-elevated risk compared to the background rate, although this elevated risk itself may still be low.

Let’s apply these statistical principles to the diagnostic screening and see where the NYT paper got wrong.

Screening Tests vs. Diagnostic Tests, what are the differences?

First of all, what are the differences between a screening and a diagnostic test? The Cleveland Clinic gave a very nice illustration here.

The core difference is, really, who will take these tests? Screenings are methods designed to be taken by the general population without symptoms to avoid devastating situations (usually later). If you are to design a screening method, what criteria must this method meet?

To me, these are “must-haves”:

  • Accurate – Relative high sensitivity and specificity, so there won’t be too many FP or FN cases
  • Very low-risk profile – should not cause risk to otherwise healthy people.
  • Cheap – can be applied to a large population.
  • Accessible – again, most of the population should be able to get it.

This list lets you easily see why many screening methods are “non-invasive”. Non-invasive testing, which has very low risk, usually uses indirect measurement. Therefore, they may have a higher false positive (FP) and false negative rate (FN) than a diagnostic test. But this is not always the case. Even with the same accuracy, a screening test will have a lower PPV than a diagnostic test. The fact that people taking diagnostic tests already have symptoms, including a positive screening result, means the prior (background possibility) for having a certain disease is much higher for diagnostic tests.

As long as the two distributions overlap, we have to “cut” somewhere

If thinking about the Confusion Matrix is too confusing, let me illustrate with my ugly hand-draw plot below. A screening test will face the healthy and diseased populations and try to tell which person belongs to which group. The rarer the disease, the bigger the difference between the size of the two groups. ?

Although we want a perfect test that makes 0 mistakes in assigning patients to the correct group, the reality is that as long as there is overlap in the measurements between the distribution of the “Healthy” and “Diseased” populations, we have to decide where to cut, thus creating wrong assignments.

The cut-off shown here is very good in terms of specificity and sensitivity. Among all healthy people, a small number are called “diseased”—the ratio of the area of the blue triangle to the overall area under the blue line is very small. And among diseased people, the same thing: the ratio of the orange triangle over the area under the orange curve is small.

However, due to a much bigger population, even with this very good cutoff, the tiny tail of the healthy people above the threshold (blue triangle) still overwhelms the disease cases called (area under the orange curve to the right of the threshold), therefore, creates a low PPV. It is inevitable that some healthy people will be called diseased (FP), and some diseased people will be called healthy (FN).


FP and FN do not carry the same cost

From a purely mathematical point of view, the optimal cut-off should be where the “Healthy” and “Diseased” cross, as the middle plot shown above. At this point, the total number of FP and FN is minimal. This is under the assumption that FP and FN have equal costs. However, in real life, FP and FN do have different costs, so the optimal cutoff may not be in the middle.

As we discussed, a screening method usually tries to prevent a relatively rare condition for some (later) devastating events.

The costs of FN, i.e., failure to report disease to a patient, may include:

  • Missing therapeutic opportunities
  • Cost of later, more complicated treatments
  • Much worse patient life quality
  • Possible loss of lives prematurely

The costs of FP, i.e., mistakenly reporting a disease to a healthy individual, may include:

  • Unnecessary stress for a prolonged time
  • Cost and risk of unnecessary follow-up diagnostic tests
  • Unnecessary procedures that cause devastating consequences, e.g., abortion of a healthy fetus

We can see that these issues and costs are usually complicated and hard to calculate precisely. However, besides the last one in the reporting FP, the cost of an FN is much higher.

We also need to consider a practical issue regarding the reporting threshold. Because the vast majority of people will get negative results, no follow-up test will be performed on them. So, there is no way to “catch” these FNs until much later, when they start showing symptoms. On the contrary, since only a very small number of people will get positive results, follow-up confirmation tests can be done to correct the screening results.

Indeed, a follow-up diagnostic test is always recommended for people with a positive screening test result. This step will minimize the devastating results from an FP in screening tests like wrong abortions.

Based on these considerations, many screening tests designed the reporting threshold to have a higher FP rate than the optimal threshold for minimal total FP and FN. So, in some cases, a low PPV is not only expected but desired.

What did the New York Times article get wrong about NITP?

The NYT article “discovered” that many NIPT tests have a low positive prediction value (PPV) on rare conditions, thus questioning their validity. Unlike specificity and sensitivity, PPV is not an intrinsic characteristic of a test. A low PPV does NOT dismiss the validity of NIPT tests. Hopefully, by now, I have convinced you that:

  1. Screening tests are designed with criteria different from diagnostic tests because their audience is very different.
  2. For rare conditions, even excellent screening tests will have low PPV due to the overwhelming number of healthy populations. This is pure math.
  3. The cost of reporting FN and FP is very different in most cases. Therefore, minimizing the total number of FP and FN may not be the best strategy for screening tests. In some cases, a higher FP rate is desired.
  4. A follow-up diagnostic test is always recommended for patients with positive screening results. This will minimize the most negative impact on patients.

Thus, using low PPV to invalidate a screening test is fundamentally wrong. Of course, it is unfortunate that these FP patients have to go through extra stress and unnecessary diagnostic procedures, like amniocentesis. However, let’s not forget the screening method used before NIPT. Doctors used first-trimester screening methods (ultrasound + blood tests with the mother’s age) to get information about the fetus's risk of chromosomal disorder. The criteria for recommending further testing, such as amniocentesis or CVS, is 1 in 250 chance or greater. That is to say, the old screening method’s PPV is 1/250=0.4%, much lower than the PPV rates of NITP that the NYT article criticized.

In fact, the number of amniocenteses performed has dramatically reduced since medical professional organizations recommended NIPT in the mid-2010s. NIPT is much more accurate than the old first-trimester screening method. Thanks to the wide use of NIPT, many more families do not need to undergo unnecessary, riskier amniocentesis or CVS compared with the "good old days", quite the opposite picture of the NYT article painted.

Is there hope to have an almost perfect screening test?

Now we know that the low PPV of rare conditions from screening tests is mainly caused by FPs from the much bigger healthy population. If we can have tests with an FP rate close to the rareness of the disease, i.e., very high specificity, we can have a high PPV even for ultra-rare diseases. Is that possible?

As we discussed at the beginning of this article, a population screening test has to be relatively cheap and easy to perform. Increasing specificity to the level of healthy population percentage for a rare disease (1—prevalence of disease) without dramatically increasing cost is very difficult.

However, I remain hopeful. If you look at the ugly hand draw plot, you know the cutoff is decided by one dimension of measurements. If we add one more data dimension, the healthy and diseased populations may separate better. The better the distribution is separated, the easier it is to assign people to the correct category. Besides testing data, we can collect data from other sources and in other modalities, like other bloodwork results, ultrasound images, the mother’s personal and family history, lifestyle, and many more, and use better tools like multi-modality AI. In that case, I am confident that almost perfect screening tests are possible in the not-too-far-distant future.

References:

https://my.clevelandclinic.org/-/scassets/files/org/patients-visitors/billing/understanding-difference-between-screening-and-diagnostic-colonoscopy.ashx

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7088458/pdf/nihms-1572971.pdf

https://obgyn.onlinelibrary.wiley.com/doi/10.1002/pd.6312

https://www.hopkinsmedicine.org/health/treatment-tests-and-therapies/first-trimester-screening-nuchal-translucency-and-nipt

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7088458/pdf/nihms-1572971.pdf

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046356/

Patrick Adams

Regional Oncology Specialist II Exact Sciences

8 个月

You have to understand the complexity and context before you can simplify and you do this masterfully Shan! Very timely

Jinchuan Xing

Professor at Rutgers University

8 个月

Very nicely done! It is hard to make this concept accessible. ??

Zhanzhi H.

Tireless Advocate for Healthy Babies for All Families via Newborn Screening

8 个月

Well done! Thanks for the effort to bring more clarity to such an important topic. I hope the editors at the NEWYORKTIMES.COM can see these two posts.

要查看或添加评论,请登录

Shan Yang的更多文章

社区洞察

其他会员也浏览了