登录查看更多内容

Fixing Covid-19 Case Number and Death Toll Underreporting

Mateusz Maciejewski

Teaching machines teach us new biology

发布日期: 2020年4月20日

In this challenging time, many of us are constantly being inundated with conflicting updates and statistics in the news on Covid-19. You may have seen recent headlines such as ‘Wuhan Revisiting their Corona Death Toll by up to 50%’, or ‘Why Epidemiologists Still Don't Know the Death Rate for Covid-19’.

This ambiguity and conflicting information can lead to frustration, anxiety, confusion and growing fear. As a scientist, I find that gaining a clearer understanding helps us shift from fear to conscious action. To help with this, I have created a simple and intuitive app (available at https://pharmhax.shinyapps.io/covid-corrector-shiny) and a walk-through to help you make sense of the data:

First, I’ll paint a picture for you to explain why I’ve developed this method. Maybe you’ll relate to the thought process, maybe you won’t, but the underlying reality surrounding it we can all relate to, because, as of today, it’s become inescapable.
Second, I’ll show you why the numbers are unfortunately likely much higher than reported.
Third, I’ll tell you how the accuracy of these numbers can be fixed.
Fourth, I’ll walk you through the corrected numbers, and point you to the app I have created for you to be able to check these numbers for yourself - it’s important to note here that you don’t need any specialized knowledge to use the app, all you need is a browser (or your thumbs if you insist on using dashboard-styled apps on your phone). I’ve also shared the GitHub repository in case you want to play with the code or extend it.
Fifth, I’ll ask you to share this article with your network, and share any data you might have regarding the underreporting at data@neurosynergy.ai, if you’re privy to it. This will increase the accuracy of the correction method presented in this writeup so that we, as a community, can better prepare and respond.

A Low Detail Rendering

Long lines in stores, guidelines to wear masks, requirements to stay in your apartment, and socially distance - you’ve seen these, but when you look at the number of reported coronavirus infections (depending on where you’re located), you might come to a conclusion that there ought to be only very few people around you who might possibly have corona, so why all the fuss?

It’s important to understand that it truly is risky to ignore the guidelines, and it seems to me that the crux of the matter is to see if the numbers of coronavirus cases (as well as fatalities) are underreported. If indeed a lot more people are infected, you may begin to see why the risk of being in public is higher than it might seem.

Spoiler alert: the numbers are indeed significantly higher than reported, so if you read only until this point, do stay at home… And do read the rest while you rest!

Ok. The picture has been outlined, at least in broad strokes. Let’s get into a deeper level of detail.

Covid-19 — the Morbid Showstopper of 2020 (and Maybe 2021, Too)

You’ve probably seen a lot of articles about coronavirus lately, and in particular about how massively underreported it might be. I’ve left some links for you above, but if you want to see more, just type in “coronavirus underreporting” in Google. You’ll be surprised by how widely this underreporting is, well, reported. So, if underreporting is no secret, why is nothing done about it, why can’t we see the real numbers?

One reason is that the real numbers simply can’t be measured.

‘Rona is very contagious, with an R0 (number of people who will catch a virus from an infected individual) estimated at 2 - 3, and with rampant discussions putting this number at even higher than that - according to a recent article from a group at Los Alamos National Lab, it’s closer to 5. Either way, it’s much higher than regular flu with its R0 of about 1.3.

But why don’t we even have a precise estimate of Covid-19’s R0? The most likely reasons behind it are the same as those leading to the systematic underreporting of coronavirus cases and deaths. Let’s have a look at some of them.

One major reason for missing reports is the prevalent asymptomaticity of Covid-19. A recent estimate of the fraction of asymptomatic carriers (who are still contagious) was provided to be 25% by the director of the CDC. Asymptomatic carriers will not notice their infection and thus won’t contribute to the reported case numbers, so these are unduly lowered by this fraction. However, a staggering number of coronavirus cases are presumed to be mild, to the degree where ~85% of cases were mild enough not to warrant a doctor’s visit resulting in these cases being undocumented, according to a study published in Science. 85%! Taken at the face value, this is a truly hair-rising figure. This first (huge) source of underreporting will mostly affect the reports of cases, and of estimates of coronavirus dynamics such as the R0, but it will not affect as much the reported fatality numbers, since these will stem from the most severe cases that will most certainly end up in the ER.

We will now switch gears and focus on the reports of deaths. Deaths due to coronavirus can be indeed underreported because of secondary death causes, local reporting practices, as well as more miscellaneous circumstances.

Let’s start with the last and most esoteric category above, the miscellany of death underreporting. One example is provided by a WSJ article, which tells a story of several instances of large death tolls that have been noted in nursing homes in Italy, with as many as over a third of their residents dying in the month of March, where none of these deaths were reported as caused by coronavirus. Other miscellaneous examples would stem from deaths that happen at home - these also appear to be widely underreported - as well as deaths of patients who haven’t been tested for coronavirus due to test shortages, and therefore have not been reported as Covid victims.

Secondary causes of death is another reason for underreporting of coronavirus-related deaths, where deaths can be ascribed to downstream causes, like respiratory failure, rather than Covid-19 as the original reason behind respiratory failure.

Local reporting practices might be another source of unreported Covid fatalities. One example comes from Poland, where only one of the two medical codes used to report coronavirus deaths was anointed for use in the official government-issued guidelines - U07.1, the ICD-10 code that denotes the cases of Covid-19 confirmed via laboratory testing, while U07.2, the code used for patients who are suspected to have contracted coronavirus because of their clinical presentation was not covered by the official guidelines. Take a wild guess at which of the two cases is encountered more often given the scarce availability of clinical tests. This way, all patients who should have been diagnosed via the U07.2 ICD code who die of coronavirus won’t appear in the officially reported numbers.

Ok, so coronavirus cases are underreported, and so are coronavirus fatalities - can we figure out what the real numbers are?

The How

This section goes into the methodological details - the underpinnings of how the code that powers our app operates to correct the reported numbers. If you’d prefer to skip “the how” right now and see these methods in action, please feel free to proceed to the following section, where the corrective methods are applied to the reports.

A couple of weeks back, I stumbled upon a very interesting medRxiv preprint on LinkedIn from Lachmann et al that focused on a part of the task outlined here: correcting the under-reported Covid-19 case numbers (as the preprint title itself reveals).

Preprints are curious things. The version of the article that I downloaded at the time has since been superseded, and the now current version of the article uploaded on March 31st carries a very different center of gravity, where the authors focus on interpolation and modeling hospitalizations, while the version of the article back from March 18th (still accessible here) put the focus on developing a simple way to correct the report numbers of any country by comparing the demographics and death rate of that country with the demographics and death rate of a reference country. The word rate in “death rate” is operative here. The underlying idea is that the true death rates should be identical in any two countries that have a similar healthcare and a similar demographic distribution.

Since death rate is the number of deaths divided by the number of cases, if we have an imperfect knowledge of cases, then our calculation of the death rate will be off. Same applies if we have an incomplete knowledge of deaths, but in theory this part is harder to miss than the cases. As we learned in the last section, we got here in the first place because reporting of cases and deaths is imperfect across the planet, but as it turns out South Korea has been remarkably apt at administering coronavirus tests, providing great patient care, and bookkeeping of cases and deaths related to Covid, thus providing as good of a benchmark as it gets.

The correction factor proposed by Lachmann et al is based on the average death rate of the reference country multiplied by the ratio of vulnerabilities of the corrected country and the reference country, where the vulnerability of each country is given as the sum sweeping through its citizens stratified by age multiplied by the mortality rates observed in people of that age. The equations describing these quantities and their relations are given in section C of version 1 of the manuscript, in case you’re interested.

Great, now we have a way of correcting the unreported case numbers - an adjustment that brings us to a number relatively close to how many cases could be detected in a given country, if its testing practices (and the overall response to coronavirus) were as good as in South Korea. This doesn’t account, however, for other factors that are extremely difficult to account for, such as the asymptomatic (or very mild) cases.

As the next step, in order to account for other factors and adjust the case numbers further and to adjust the death report numbers as well, we will use a simple multiplier. This multiplier will not be a single point, but rather a whole Gaussian distribution, which means that our estimates of corrected report and death numbers will no longer be represented by one point per date either, but rather whole collections of possible values.

The mean value of this multiplier and its standard deviation are currently rather arbitrary. But going forward we will treat them as Bayesian priors, and once we have the data on what fraction of cases and deaths goes unreported in a given country, I will use the Bayes’ theorem to combine these and to arrive at far more realistic posterior estimates of the multiplicative corrective factors for each country.

It should be noted that there are other very promising ways to correct the death report numbers, including using the deviation from the expected number of deaths in a given region at a given time, which currently can all be ascribed to Covid-19 deaths. This approach has been used by The Economist, and I will track it carefully as eventually it might become a source of data that could be merged into the approach discussed and shown here.

Correcting the Reports

The methods described above - case number correction from Lachmann et al and the Gaussian multiplier correction (that will in the future be treated as a prior to a more accurate Bayesian correction, once the data on underreporting becomes available) - it seemed to me like it would be great to have an app to look at the numbers from each country, apply the corrections, and immediately visualize the estimates. So I built a Shiny app to do just that! Here it is:

https://pharmhax.shinyapps.io/covid-corrector-shiny/

and here’s the GitHub repository:

https://github.com/pharmhax/covid19-corrector

It should be noted that the code, referential correction methodology, and the visualizations are heavily based on the Lachmann et al preprint.

Let’s jump right into the visualizations and start by having a quick look at the raw, uncorrected case and death numbers in the USA and South Korea:

Note that in the plot above cases and recoveries are tracked on the left-hand y-axes, while the right-hand y-axes correspond to the deaths.

We can see that the death rate in the US is behaving in a less stable fashion than in South Korea:

Here the reference country (South Korea) is shown in grey; the country being corrected (USA) is shown in red.

This instability perhaps serves as an adequate harbinger of the corrected case number plot for the US:

~4.5M cases, 4.5M! Not under 1M - I hope it’s now clearer why thou shalt social distance and wear that face mask.

Now, let’s assume that only about a third of the deaths are reported in the first place and apply a statistical 3-fold multiplicative correction to both the deaths and cases:

That’s ~13.6M cases (95% CI: 9.6M - 18.4M) and 121.4k deaths (95% CI: 81.5k - 161.6k). It should be pointed out that the multiplicative correction is applied without any data, so before we have the data to make the statistical correction more realistic, these estimates should be taken with a bucket of salt!

Now let’s have a look at Poland as a representative country that doesn’t include the (predominant) U07.2 cases and deaths in its guidelines:

~9.3k reported cases have been corrected to ~37.4k - a significant increase, but this number might seem relatively lower (by population proportion) than what we’ve seen in other countries. Again, we’re going to assume that only about a third of the deaths are reported in the first place and apply a statistical 3-fold multiplicative correction to both the deaths and cases in Poland (which for the cases might be actually conservative, given that we’re also missing the asymptomatic patients!) and see the adjusted numbers:

Our cases have undergone a dramatic increase from ~37.4k projected cases to ~112.7k (95% CI: 77.4k - 146.7k), and from ~360 reported deaths to ~1,100 (95% CI: 740 - 1,450), so in each case the increase is approximately 3-fold, as expected from the settings that we used in our correction.

Let’s have a look at Italy:

These numbers are staggering - current case number estimate of just under 180k is adjusted using the data from South Korea to over 10 times as much, i.e. 1.76M. Further adjustments using the multiplicative method (where we’re multiplying by 5, as Italy’s healthcare system has been particularly overwhelmed during this crisis) show that the cumulative number of cases might as high as 8.8M (95% CI: 5.4M - 12.2M), with 117.5k deaths (95% CI: 74.8k - 160.4k).

And finally, let’s have a look at China:

China hovers at around 84k reported cases and has recently updated their death toll from <3.5k to ~4.6k (this sudden uptick can be seen across the plots above). After adjustment the number of cases raises over 10-fold, to ~900k. Due to the recent articles, we can assume that the underreporting of both cases and deaths in China is widespread, so we apply a multiplicative correction of 5. This shows us that there may be easily have been as many as 4.5M cases (95% CI: 2.7M - 6.3M), and 23k deaths (95% CI: 14k - 32.5k) - a number that, according to recent headlines might be still understating the true magnitude of the crisis in China. This example provides a good illustration to the fact that data is badly needed to make this corrective measure more accurate and enable it to capture the nuances of reporting specific to each country on a case-by-case basis.

From Priors to Posteriors - you can help make the estimates more realistic!

As mentioned in passing earlier, the multiplicative Gaussian correction presented here lends itself well to the Bayesian paradigm. Without going into excessive mathematical detail, we can use the Bayes’ theorem roughly as follows:

P(m | D) = P(m) P(D | m) / P(D)

where P(m | D) is the posterior estimate of the multiplier once we’ve observed the real life data on what fractions of cases and/or deaths go unreported, P(m) is the our pre-data prior estimate (that’s currently used in the app), P(D | m) is the likelihood of the observed data D under a given multiplier m.

I bet you’ve seen this a million times - if you haven’t, you might have an inkling for which part of this is crucial, but missing. It’s the data, D!

If you have any data, i.e. knowledge regarding what fraction of cases or what fraction of deaths due to coronavirus go unreported, please contact us at [email protected]. Also, please share this article with your network, and especially with healthcare professionals, as they might have a valuable insight into what the unreported fractions are. This will help turn the simple multiplicative correctors presented in this article into realistic statistical correctors.

That’s all for now. Stay healthy, and remember - social distancing and wearing those face masks does make sense (and the numbers, especially when adequately corrected, *are* showing it).

---

Opinions and content expressed in this article are solely my own and do not express the views or opinions of my employer.

Originally published at NeuroSynergy.

Paul Yaworsky

Chief Scientific Officer at Mediar Therapeutics

4 年

Very interesting ! Thanks for sharing!

1 次回应

Nicholas Wells

? Helping life sciences companies navigate UK, EU & US regulations to market & maintain their innovations successfully ?

Interesting article. 2 questions. 1) did you create this on the basis of underreporting being a fact? 2) did you build anything into your calculations on the potential for overreporting of deaths related to COVID-19?

2 次回应

Iain Kilty

Chief Executive Officer at Sitryx

Thanks for sharing this sobering analysis Matt, very interesting approach to analysing the data that starts to get at some of the big question

查看更多评论

Fixing Covid-19 Case Number and Death Toll Underreporting

Mateusz Maciejewski

Teaching machines teach us new biology

A Low Detail Rendering

Covid-19 — the Morbid Showstopper of 2020 (and Maybe 2021, Too)

The How

Correcting the Reports

From Priors to Posteriors - you can help make the estimates more realistic!

社区洞察

其他会员也浏览了

Concerns arise about Third wave of infections ….part 22

Three Ways that Distrust Supports the Spread of COVID-19

When Hospitals Have Zero Pandemic Victims, Is the End Far Away?

COVID-19 - How Will It End?

Dr Kevin Maloney 7/25 Corona Virus update Regarding Seeing patients in-office

Reflections on 2020 & Thoughts for 2021

A COVID-19 Tale of Two Continents

Resurgent Virus Rages Across the American Heartland (New York Times)

The US Sets New One-Day Record for Coronavirus Cases.