Notes from Dr. Gene, No. 2: Data, models & predictions

On March 25th, I published my first comments and predictions on the coronavirus outbreak. Many people responded with their comments, concerns, and questions. Thank you all very much! Surprisingly, a majority of the responses came via email or private message. Here in Notes Number 2 I will address the most frequently asked questions about the used approach, including data, models, and predictions. Before I explain, I want to clarify that my notes are in no way scientific publications. In fact, they are quite the opposite. They are written in (hopefully) plain English, aimed at explaining sophisticated items for both experts and non-experts, and do not require a deep grasp of data science.

First, the data. I agree with so many concerns (of so many people ??) that current data are limited, noisy, and somewhat untrustworthy, and therefore have to be treated with caution. For example, I couldn’t find data sets on the infected cases across the US metropolitan areas, only the state-wide data. Additionally, as the number of individuals being tested increases, we can expect that the number of confirmed cases will increase as well. Therefore, some normalization is needed. But this kind of data is not a new problem at all. In fact, I have worked with messy kinds of data my entire professional life. This is why I use data from different complementary sources.

Models come second and predictions come third. The results from various models are combined to produce more accurate predictions. Let me illustrate this three-step (data-models-predictions) approach using Worldometers data (Disclosure: I have no financial interests in this site and have no idea who developed it.) The data about tri-states (NY, NJ, CT) and WA are attached in the pdf file (see below). I downloaded the data on the evenings of Sunday March 29th and Saturday April 4th. For simplicity, let’s look at the Washington data. As you can see, during this week (3/29-4/4), the number of total cases and fatalities grew in WA, but there were no new cases and fatalities on 4/4. This is indicative of the contained situation, where we are getting close to the bell curve’s peak (see 3/25 Notes).

I’m analyzing not only changes in the number of total cases (this rate of change is called “first derivative”), but also changes in the number of daily new cases (rate of changes of the first derivative is called “second derivative”). When we start on the bell curve’s left side, the number of new cases grows (both derivatives are positive). Eventually the growth starts slowing down as we reach the curve's peak and then it turns into a reduction in the number of new cases. This means we are over the bell curve’s peak or maximum (both derivatives are zero) and are going into the curve’s right side (both derivatives become negative). And please, don’t forget about the importance of normalization, as mentioned in the above paragraph about data. This is why, despite still high numbers of the cases, I am reconfirming my March 25th prediction on containment in Washington. 

Similarly, despite a significant increase in new infections and fatalities in the New York metropolitan area (using tri-states as proxy data) during the last week, and despite anticipating that the next few weeks will be really bad, I'm reconfirming my prediction from March 25th that containment in the New York metropolitan area will take approximately two months. 

After March 25th, some medical experts, officials, and media started talking about a slowdown in Washington, including Seattle mayor Jenny Durkan saying that "we slowed the transmission" (NYT, 3/29). Please let me know what you think. Keep safe and upbeat. We will overcome it.

Eugene Kolker

No alt text provided for this image


Levi Shapiro

Founder, mHealth Israel

4 年

Thanks Eugene Kolker, PhD for the valuable insights.

回复

Thank you so much for the update, Professor! What I always love about your analytics is your ability to extrapolate from very noisy and incomplete data. This quality is very rare these days - most are relying on incoming data as on some kind of an axiom, so when even small detail at the data ingress becomes invalid the entire modeling falls apart. And that's what differentiates true data scientists like you from bunch of other "predictors". Thanks a lot for doing this and for publishing this openly! BTW, really hope more people will start using Sync.MD - then we will be able to supply an incomparably higher quality datasets.

回复
Dave Beier

Center Director at Seattle Children's

4 年

Well maybe, except WA department of health says the data for the previous week is not up to date. And UW hospital system inpatients continue to increase (albeit slowly). So maybe.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了