COVID-19: Counting the Hidden Cases

COVID-19: Counting the Hidden Cases

A Data Scientist & Engineer Looks at the Data

No alt text provided for this image

Like many during this period, I've been reading and listening to the growing understanding of the global pandemic, and as someone who has studied exponential growth in artificial genetics (genetic algorithms and evolutionary computation), I've paid special attention to the data. The wide variation in predictions in such models is to be expected because small changes in parameters in exponential models can cause large differences in modeling outcomes, and as such using models with exponential growth for practical matters should always be tempered with a good deal of skepticism, a skepticism that has been lacking in the model makers as well as media and government members alike. Moreover, as an engineer and engineering educator who grew up on slide rules and graph papers, I am concerned by the modern tendency to rely on computational modeling outputs without a corresponding uptick in gray-matter engagement, a tendency exacerbated by the proliferation of canned simulation software and apps.

Noticing the Data: The Early Data Anomaly (15-29 February)

The other day, I was looking at the data of the COVID-19 total case growth in the US, and as an engineer brought up in the good old days where we had to plot our own semi-log graphs by hand, I was puzzled by an early data anomaly in the US data (see above).

The anomaly. In this regard, the data from 15-29 February shows flat to very slow growth, and the data from 1-19 March shows clear exponential growth (a straight line on a log-linear graph is indicative of exponential growth). Although the compartment models of epidemiology are nonlinear, when an epidemic breaks out the models suggest that the growth will initially be exponential (linear on a semi-log graph) until a large number of infections have taken place.

Why the anomaly is important. The data anomaly is hugely important. COVID-19 was quite contagious and once it arrives in a country we should expect it to infect people in an exponential manner. We do not expect a period of slow growth while the infection incubates or gets ready to infect. It should be exponential from the get go. As such, careful reflection on the data suggests that the period of flat/slow growth was a period of largely undetected spread. Moreover, the data starts on February 15. We know that there were cases on the West coast as early as 20 January. As such, we should expect any early landing to be spreading at exponential rates back much earlier than the data set suggests (possibly even before 20 January).

Why was this missed by other modelers and data scientists? It's not easy to answer this question, but it may be that the period of early growth was ignored by data scientists that were looking to estimate the exponential growth rate. Graphs of the data comparing growth in different countries start from the 100th case to eliminate the "bad" early data and gives the expected exponential growth of the pandemic in the early stages. This is understandable from a parameter estimation perspective, but the period of early bumpy data suggests a data collection problem, and straightforward reflection indicates that the virus was spreading exponentially fast during this period in cases that were under the radar. In other words, there are many many hidden cases, cases that are not confirmed (or counted) because the patients were asymptomatic or untested.

In other words, there are many many hidden cases, cases that are not confirmed (or counted) because the patients were asymptomatic or untested.

Estimating these hidden cases is important to understand the severity of COVID-19 versus other health epidemics that are shouldered by society in more routine ways (seasonal flu, the common cold).

Period of Hidden Growth: Key to Counting Hidden Cases

As such, here we assume that the period before 1 March, (the period of hidden growth) would be better represented by the exponential growth of the 1-19 March data. We proceed in three steps:

  1. Estimate the exponential growth using the 1-19 March data.
  2. Minimal correction: Project back to 15 February using the exponential growth of the 1-19 March data, and shift the curve up a constant. Doing so represents the assumption that the data collected represents a constant fraction of the total cases. We will consider this projection a lower bound on the number of cases (measured + hidden).
  3. Earlier entry correction: Project back even further (before 15 February) to help determine how high the case total might be under the assumption of earlier entry. These projections will be limited in accuracy because the linear projection does not account for the nonlinearity of epidemic growth. Moreover, working with exponential growth numbers is tricky because small changes in parameters give large changes in out put. Nonetheless, the numbers are enlightening and suggest important lessons.

Consider each of these in turn

Estimating the Exponential Growth Rate: 1-19 March 2020

Inspecting the curve, we see that the period 1-19 March 2020 shows as a relatively straight line on a semi-log curve. Taking log base 10 of the ratio of number of cases as log10(13865/75) = 2.267 and dividing by the number of days (19-1 = 18), we get the slope of the line as 0.1259 decades per day. Inverting, this tells us that the virus infection increases by a factor of 10 every 7.94 days ~ roughly 8 days. This number is consistent with data from other countries. Of course, this data is averaged over data from many states but infectious diseases don't really care about state boundaries and while planes, trains, and automobiles are in place, treating a give country as a single big pot is a good first approximation. The key thing is that doing so here will not change the estimate of this initial growth.

To put this in better perspective, the number of total cases grows by a factor of 10 every 8 days or so. In 6*8 = 48 days (roughly 7 weeks) a single case would grow to a million. Given the size of the country (330 million) or say California (40 million), this growth would not slow in this period. Moreover, as long as people were traveling between cities cases in early infected states would get to other states. Moreover, importation of cases from China would be expected in any city with direct connections to the mainland following the outbreak in Wuhan. All these factors suggest that these corrections are conservative (on the low side)

Minimal Conservative Correction --> 9.2 Million Cases on 14 April

Assuming that the 15-29 February data is badly represented by the data stream, and that the growth during that period is one decade every 7.94 days, we can conservatively project the data on 1 March back to February 15th. Doing so by looking on the chart and projecting that line back, we see that it suggests that roughly one case would be present on February 15. We call this the linear back-projection of the data. Epidemiological models usually result in constant exponential growth early on, and this back-projection simply imposes this correction to the data, assuming that the earliest stages should have grown exponentially as fast if not faster than the March 1-19 data.

Given that the exponential back projection of the data dictates one case on 15 February, but the data shows 15, we can adjust the data upward. Taking the log10 of the ratio 15/1 gives us 1.176 decades. This suggests that the whole curve should be shifted up 1.176 decades to give a minimal correction for the incomplete data of 15 Feb-1 March. Making this adjustment means that the total number of cases shifts up by a factor of 15 (1.176 decades on log scale) or 15*613.886 = 9,208,290 cases on 14 April. Given that data collection was inconsistent and missed many cases, using the 15 cases that were known is itself conservative. The actual number is likely to be much higher.

This correction suggests that all mortality, hospital utilization, and all other numbers based on total case count in the denominator are overstated by a factor of at least 15.

This correction suggests that all mortality estimates, hospital utilization estimates, and all other ratios based on total case count in the denominator are overstated by a factor of at least 15.

Moreover, we know that the virus was in the country earlier than February 15th and projecting back in time further increases the total case count estimate further

Correcting for an Earlier Time Zero: Effective Time Zero

This initial correction of the data set leaves us with a time zero of infection as 15 February. The first case in the US was reported on January 20, 2020. Given the lack of testing and awareness and given that the outbreak in China was well underway and well exported in January, an initial date of 15 February seems much too late as the initial time of infection in the US. Here we can project the exponential curve back further in time by simply defining an effective time zero, that time at which one case was present in the US. For each earlier date, we get a higher estimate of the number of cases. Using the growth factor of .1259 decades per day, each week results in 0.8813 decades of growth or a multiplicative factor of 7.609.

If we shift time zero back using the back-projection method to 8 February this would give a total of 70.1 M cases. Projecting back further (one or more weeks) and we get unrealistic projections which aren't physical (numbers larger than the population of the US). The exercise is enlightening, however and the next section explores some of the policy indications of this reasoning.

Review of Modeling Assumptions

I am making the following assumptions:

  • Early growth of the epidemic follows roughly constant exponential growth until the number of cases grows to be a significant fraction of the population size.
  • The data is sampled sufficiently consistently from 1-19 March and the resulting estimation represents the growth process.
  • The total case data as measured is a roughly constant fraction of the actual total cases (hidden + measured).

All of these assumptions are reasonable and the projects should be in the ballpark as long as the case count is a fraction (roughly half) of the total population size.

Implications for Personal Behavior & Policy Making

This analysis of the US COVID-19 total case data suggests that there have been many more total cases than have been accounted for in the official data.

  • There are at least 15x to 114x more cases than the raw data reports. Straightforward use and adjustment of the data suggests that COVID cases are undercounted by a factor of 15 and possibly more. Back projecting the data one week suggests a factor of 114x more cases than the data shows.
  • Death rates and hospital utilization rates are grossly overstated. A large number of hidden cases suggests that much of the rationale for shutting economies and suspending civil liberties is overstated. Taking the current ratio of deaths to cases 26,185/615895 = 4.25% and derating by a factor of 15 gives a 0.3% death rate equivalent to a bad flu. The number is likely to be even lower.
  • Media hype and government action may have frightened people unnecessarily. The virus does kill (so does the flu), and the shut down of organs due to hypoxia is an awful death. Nonetheless, we do not do play-by-play on individual flu deaths in a bad flu year, and the attention paid to this virus is sensationalistic and out of proportion to its danger.
  • You may have already had COVID-19. If you had a cold or flu in January-March and recovered, it is reasonably likely that you had COVID-19 even if there were "few cases" in your state (you may be a hidden case).
  • Antibody testing is important to help quantify the number of hidden cases. The calculations herein are reasonable, but they derived from a data set that was calculated under variable and varying conditions. Careful studies in Germany and elsewhere corroborate the suggestions of this article. We need the antibody data now. The delays and relative lack of urgency placed on antibody testing by our experts was and is a mistake.
  • The data does show slowing of the growth rates following March 27. More complete modeling that accounts for the hidden cases properly is necessary to understand whether the social distancing and stay-in-place measures were responsible or growing herd immunity. If there are numbers of total cases on the high side, then much of the slowing is coming due to increasing saturation of the population with infection.

In short, this analysis suggests that cooler heads approach the current situation with facts and data and more common sense. Media and rhetoric should change to match the actual risk of the situation. Individual cases should be treated with best known medical practices to save lives. And we should return to normalcy region by region in a prudent manner as quickly as possible using this cooler understanding of what has actually happened.

David E. Goldberg is perhaps best known as an AI pioneer (genetic algorithms) and for his first book Genetic Algorithms in Search, Optimization, and Machine Learning (1989); his latest book is A Whole New Engineer: The Coming Revolution in Engineering Education (2014). In 2010, Dave resigned his tenure & a distinguished professorship to work full time for the transformation higher education. A trained leadership coach (Georgetown) and president of ThreeJoy Associates, a change leadership, coaching, training & consulting firm in Douglas, MI. Dave can be reached at [email protected]


Luis R Lopez (BS, BA, MS Physics, CEH, Applied AI)

Applied AI, Cross Platform App Design, Machine Learning, Algorithm Development for Robotic Applications, Software Visualization and Re-Engineering.

4 年

I completely agree with your analysis Dave. Esitmating and measuring missing cases will be eye opening. At least a 20x under counting factor must be present to explain the current data. Having penetrance numbers sooner would have saved alot of our economy. Bad mistake to not have begun penetrance testing on day 1.

回复
Dan Heck MA, PMP

Mind Mover- Realize on-time deliverables by leading, teaching and facilitating execution of new or broken projects of all sizes.

4 年

How are you planning to get these insights to Dr. Fauci and his team? How does this tracking which compares to a normal virus account for the case loads in italy, spain, NYC?

回复

要查看或添加评论,请登录

David E. Goldberg的更多文章

其他会员也浏览了