CoVid-19 Data Unleashed
There’s significant demand for understanding the trajectory of CoVid-19, not for lack of data, but for a strong desire to understand that data. Such a vacuum can bring armchair epidemiologists out of the woodwork. Worse, biased data analysis and skewed conclusions (see New York Times article) can drive political agendas. The very audacity!
We’re all in this together, regardless of politics. Present times should be an opportunity to learn something useful if only to get through with our sanity intact, but also to prepare for the inevitable next CoVid. In my humble opinion (as they say), we in the burgeoning data community need to do better.
I think it’s time we have that little talk about epidemiology. For example, take the concept of prevalence. The number of cases of CoVid-19 in a country or state is vital, but to compare two or more areas, we need prevalence. Simply put, prevalence is the number of cases per 100,000 population. For example, as of April 4, New York State counted and reported to CDC 90,279 cases. CDC calls these “confirmed or presumptive positive” cases reported by New York State to the CDC. (Caveats: No number is without issues. Doubtless, there are more cases out there. Not all states report consistently. Testing ramps up at different rates, and so forth.) And, we’re trying to measure a fast-moving target.
Still, New York State reported 90,279 cases as of April 4. Divide that by a population of 19.5 million and multiple by 100,000 for a CoVid-19 prevalence of 464, the highest among these United States. Now we can properly compare states without the unnecessary fog of population size. The top six states, ranked by CoVid-19 prevalence as of April 4, are New York (464), New Jersey (251), Louisiana (138), Massachusetts (112), Michigan (108), and Connecticut (100). Ultimately, prevalence is a measure of the relative “burden of the disease” on society.
There are two kinds of prevalence: point prevalence and period prevalence. Point prevalence is the number of cases per 100,000 population on a given day, for example. Point prevalence for a given year permits year-over-year time series trending. Period prevalence, on the other hand, is the cumulative number of cases per 100,000 population over a period of time, say from January to April. The CDC-based numbers above represent period prevalence from January 21 (date of the first confirmed case in the US) to the present. Most cases during that period emerged and resolved, but they still count. So for tracking CoVid-19 now, period prevalence seems the more useful number, especially for comparing the phases (dormancy, take-off, slow-down, decline) of the disease from one state to another.
Prevalence has a close cousin, social distance aside: incidence. We all see the charts showing the number of new cases in a day. That’s crucial for assessing mitigating measures like “hand washing” and “stay-at-home” orders. But for proper state-to-state comparisons, we need to know the number of new cases per 100,000 population (ideally, the healthy at-risk population) or the daily incidence rate. Changes in daily incidence rates tell us if mitigation measures are working. Because incidence records new cases per 100,000 people, it is also a measure of the risk of contracting the disease. So prevalence is the burden; incidence is the risk.
Flattening the curve implies lowering the incidence rate. Prevalence will increase as long as more people contract the disease, but if daily new cases start to decline, lower incidence rates driven by mitigation measures will be responsible. Let’s look at some data! This is a DIY analysis, by the way.
This link takes you to the data on Tableau Public, which permits user interaction. For example, you can sort the states by CoVid-19 prevalence and by the percent change in prevalence. Note that Louisiana, Idaho, and South Dakota have the highest, one-day percent change in prevalence. That’s only one data point, of course, so it is not necessarily a trend. Indexes are calculated to compare states relative to the US as a whole. Maps facilitate a more spatial perspective on the data. I prefer two-variable circle maps that show, for example, size representing prevalence and color representing change. The data will be updated periodically to permit analysis of incidence rate trending by state.
Thank you and be well…