Combating COVID-19 with Data?: A Case for Moving Averages

Combating COVID-19 with Data: A Case for Moving Averages

As COVID-19 takes an unspeakable toll on the world with more than 5 million infected, and as medical professionals, policymakers, pundits, and others attempt to make sense of this pandemic, I’m at least encouraged (as a data scientist) by the central role of data in ongoing discussions. Newscasts, articles, congressional briefings, and other press briefings are replete with references to “bending the curve,” case rates, derivatives, and even multivariate regression. So, as the nation sprints toward the grim inevitability of surpassing 100,000 COVID-19 deaths, I wanted to dive into the data—specifically, to examine how moving averages are being implemented and how they improve our understanding of case rates and death rates.

Moving averages (also known as rolling averages) are commonly employed to help smooth data while not overtaxing the brain with complex algorithms—you can perform them in your head! For example, I binged 14 hours of The Office on Netflix two days ago, 12 hours of Arrested Development yesterday, and ten hours of Archer today, so yesterday’s three-day moving average is 12 hours viewing time. Moving averages are especially beneficial to time series data, which may vary predictably (such as by season or weekday), as well as to data that vary because of data collection or reporting practices, omissions, or errors. COVID-19 data meet both criteria, so it’s no surprise that moving averages (and especially the 7-day moving average) are commonly reported for both daily COVID-19 case and death rates.

Heartfelt thanks to Johns Hopkins University (JHU) for creating and continually updating the COVID-19 GitHub repository (https://github.com/CSSEGISandData/COVID-19) whose raw data comprised the substrate of this analysis. JHU simultaneously maintains the COVID-19 international tracker (https://coronavirus.jhu.edu/map.html), and both resources continue to be instrumental to the researchers, epidemiologists, medical professionals, and others combating the spread of the virus. 

Moving Averages in US Analysis

The cumulative total of COVID-19 cases in the US appears relatively smooth from a distance, in part due to the high number of cases and in part due to the nearly three months of data that are represented. The following graph depicts cumulative US cases from March 1st through May 21st.

No alt text provided for this image

Despite the clear curvilinear trend in cumulative cases, COVID-19 daily case rates—the first derivative of the cumulative case count—demonstrate a scalloped pattern not readily discerned from the previous cumulative graph. Moving averages help smooth these undulations, with 3-, 5-, and 7-day moving averages displayed in the following graph.

No alt text provided for this image

Note that the smoothness of a moving average increases commensurately with the duration of data that are included in the average. Thus, the 7-day moving average removes more variance (and is smoother) than the 3- and 5-day moving averages.

COVID-19 death rates, in addition to case rates, also demonstrate a scalloped effect, with its one-week periodicity evident when Saturdays and Sundays are illuminated (in silver).

No alt text provided for this image

Periodic trends should be identified within time series data to facilitate analysis and interpretation. For example, a catchy, clickbait headline might read “US COVID-19 Death Rate Doubles May 18th to May 19th,” and although mathematically accurate (insofar as JHU raw data represent), this irresponsible headline would fail to capture the downward trend over the past several weeks. Yes, May 17th recorded 89,562 US deaths, May 18th 90,347 US deaths, and May 19th US 91,921 deaths—so the reported death rate did increase from 785 to 1,574 in one day. The 7-day moving average unequivocally represents this trend.

Another way to evaluate the time series nature of COVID-19 data is to group daily rates by weekday. For example, during the six-week period spanning April 5th to May 16th, the influence of weekday is remarkable, with 15,745 US deaths reported on Thursdays and only 50 percent (N=7,880) of that rate reported on Mondays.

No alt text provided for this image

Although it is possible that weekday does influence death rate (or death rate reporting) in some measurable fashion, the majority of this observed variance can likely be attributed to data collection and reporting artifacts—the bottlenecks or breakdowns in the COVID-19 data pipeline, such as personnel who are understandably unavailable or working at a reduced capacity during the weekend.

Moving Averages in State-Level Analysis

Diving further into the COVID-19 data can help elucidate how states and regions contribute to time series variance observed at the national level, further making the case to incorporate moving averages (or some other method of data smoothing) into analyses and decision-making.

Some spikes and troughs in case and death rate may reflect “reality”—at least insofar as reality can be assessed given the paucity of testing in the US, and with testing rates varying dramatically by state and region. Notwithstanding the pervasively insufficient COVID-19 testing in the US at present, case rates can still approximate state and regional trends, with this approximation improving as testing penetration increases.

Indiana, for example, recorded its highest COVID-19 case rate (949 cases) on April 27th, with the cumulative number of cases jumping from 15,012 to 15,961 in one day.

No alt text provided for this image

A quick Google search yields the April 27th IndyStar article “Cass County [Indiana] accounts for nearly half of new coronavirus cases in Indiana Monday,” which states that “Cass County – home of Tyson Food's Logansport pork processing plant – saw 439 new cases of the novel coronavirus in Monday's count.” (https://www.indystar.com/story/news/environment/2020/04/27/cass-county-coronavirus-cases-spike-county-home-meat-plant/3033246001/

No alt text provided for this image

Although the April 27th data spike reflects an actual spike in Indiana recorded cases, the 7-day moving average distributes this spike, reflecting the reality that cases likely would have occurred over a period of days.

In other cases, spikes or troughs in recorded case and death rates may not reflect reality but rather an artifact of data collection or reporting—underrepresenting some weekdays and overrepresenting others. This typically occurs when one or more sites, offices, agencies, or organizations in the COVID-19 testing pipeline—from swabbing patients to JHU data aggregation and reporting—is closed, has reduced operations, or is otherwise unable to process throughput fully.

For example, California daily death rates typically show lower-than-average Saturday rates and substantially reduced Sunday rates. These undulations most likely reflect one or more closures in the COVID-19 testing pipeline (in one or more counties), rather than actual variance in California death rates. Thus, the 7-day moving average smooths out this artifact, reflecting a more realistic death rate at any point in time—again, insofar as testing has identified COVID-19 related deaths.

No alt text provided for this image

Finally, in still other cases, observed variance in time series data may be an artifact of data aggregation and reporting—essentially, a failure to report cases, which results in missing data within the JHU GitHub repository. For example, Massachusetts case rates (as reported by JHU) are identical for April 19th and 20th, with a suspicious spike in cases on April 24th. Hmm…

No alt text provided for this image

The simplest way to demonstrate that these data are incorrect is to first locate Suffolk County, Massachusetts (in which Boston resides) within the JHU GitHub repository for both April 19th and 20th. On April 19th, JHU records the following raw data:

25025,Suffolk,Massachusetts,US,2020-04-20 23:36:47,42.3279514,-71.07850442,8074,236,0,7838,"Suffolk, Massachusetts, US"

(https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/04-19-2020.csv)

JHU records identical raw data on April 20th, representing a duplication—i.e., an omission of new values:

25025,Suffolk,Massachusetts,US,2020-04-20 23:36:47,42.3279514,-71.07850442,8074,236,0,7838,"Suffolk, Massachusetts, US"

(https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/04-20-2020.csv)

The data are further shown to be incorrect through inspection of the Massachusetts Department of Public Health COVID-19 Dashboard for April 20th, which confirms 1,705 new cases on April 19th (matching JHU data) and additionally shows 1,566 new cases on April 20th (whereas JHU data reflect 0 new cases). (https://www.mass.gov/doc/covid-19-dashboard-april-20-2020/download)

No alt text provided for this image

Given these missing data and the replacement of these data subsequently on April 24th, the 7-day moving average smooths Massachusetts data to present a more accurate trendline—despite the messy data.

Far from being an uncommon occurrence, more than 30 states demonstrate one or more similar shifts in data—in which the case rate or death rate reported for the entire state (within the JHU GitHub repository) is missing for one or more dates, immediately followed by a spike that (presumably) includes these missing data.

To date, the most egregious case (at the state level) of shifted COVID-19 data involves the Maryland death rate, with JHU reporting 736 cumulative Maryland deaths as of April 29th, 893 deaths as of April 30th, an unrealistic 1,730 deaths as of May 1st, and a reduced 1,001 deaths as of May 2nd. This extreme fluctuation, as well as the robustness of the 7-day moving average to overcome these degraded data, is demonstrated in the following graph.

No alt text provided for this image

More advanced techniques for data interpolation of missing values exist but are not discussed herein. It also goes without saying that the preferred method would be to fix the data at their origin—in this case, the JHU GitHub repository—so that downstream analyses could benefit from more accurate data. The JHU GitHub repository reveals, however, that the “Last Updated” date is typically unchanged after the initial data upload, as demonstrated in the following screenshot. This likely represents a conscious decision by JHU to correct case rates and death rates going forward rather than retroactively.

No alt text provided for this image

So Are State Case Rates Increasing or Decreasing?

With the majority of states now reducing professional and personal restrictions, COVID-19 case rates and death rates—among other metrics—are not only commonly discussed but also heavily utilized in policymaking. Irrespective of whether you agree or disagree with a particular federal, state, or local policy, it’s important to understand how interpretation of COVID-19 daily incidence can be skewed, especially when only raw data are consulted.

For example, the White House’s April 16th “Opening Up America Again” guidelines (https://www.whitehouse.gov/openingamerica/) propose “gating criteria” that were expected to have been completed before beginning the reopening. Specifically, one criterion includes a “Downward trajectory of documented cases within a 14-day period” prior to “proceeding to Phased Comeback.” Note that this gating criterion includes an alternative method for satisfaction—a “Downward trajectory of positive tests as a percent of total tests within a 14-day period (flat or increasing volume of tests).” However, this alternative criterion is not discussed herein as this analysis examines only case and death rates irrespective of COVID-19 testing rates.

One method to investigate this single White House gating criterion could examine—for each day—whether the case rate increased or decreased for a particular region. For example, North Carolina COVID-19 case rates are demonstrated in a 15-day period between May 2nd and May 16th, with the raw case rate values shown above each bar.

No alt text provided for this image

May 2nd to May 3rd represents a decrease in case rate (from 518 to 182), May 3rd to May 4th an increase in case rate (from 182 to 201), and so forth. This analysis demonstrates that the North Carolina case rate increased on seven days and decreased on seven days. A headline referencing these metrics would be mathematically accurate, yet grossly fail to represent that the daily case rate is consistently trending up during this period, as indicated by the 7-day moving average, which smooths the variance demonstrated in the raw case rate data.

Thus, a more useful representation of case rate trends could be to evaluate whether the 7-day moving average for each day increased, decreased, or remained the same. The following graph demonstrates that the moving average for North Carolina’s daily case rate increased for 71.4 percent (N=10) of the 14 days, with the values of the 7-day moving average appearing above this trendline. Again, to revisit this single White House gating criterion, a two-week downward trend in case rates is required—which these data do not demonstrate.

No alt text provided for this image

But there is good news in the mix, as New York, devastated by COVID-19 like no other state, reported decreases in daily case rate for 93 percent (N=13) of days over the past two weeks!

No alt text provided for this image

Moreover, when this analysis is expanded from 14 to 30 days, the New York case rate moving average has decreased 90 percent of those days (N=27).

No alt text provided for this image

With these two examples demonstrating that states are trending toward higher and lower case rates, a natural comparison among all states is warranted. The following graph demonstrates how all states compare in regard to the number of days of decreased cases over the past two weeks. However, because raw data have been utilized, the graph is flattened, with the majority of states appearing to have relatively stable rates—that is, case rates that are increasing about as many days as decreasing over a two-week period.

No alt text provided for this image

States to the left represent those whose daily case rates are most consistently decreasing (although not necessarily most steeply decreasing), whereas states to the right represent those that have increasing case rates more days than not. Shading conversely represents the 7-day average case rate on May 18th per million persons per state, with darker shading representing those states with the highest per capita incidence of COVID-19. These averages appearing above the bars correspond to this shading.

For example, although New York is one of the most consistently downward trending raw case rates among states (and thus appears to the left of the graph), it nevertheless had 98 cases per million persons—its 7-day average as calculated for May 18th (which includes data spanning May 15th to May 21st). Notwithstanding, because raw case rates mute actual state trends, 7-day moving averages can be used in lieu of raw case rates.

The updated graph now counts the number of days of decreasing case rates (as measured by 7-day moving averages) in the same two-week period. Note that the number of states reporting seven days rising and seven days decreasing has been reduced from 17 to 11. Thus, the following graph helps eliminate variance to better reveal upward and downward trends in the data.

No alt text provided for this image

Note that the lightest shaded states (e.g., Montana, Hawaii, Alaska, Vermont) may have daily case rates so low that their moving averages are rendered meaningless. For example, Montana appears to the far right, indicating that its case rate decreased on only three of the past 14 days; however, this was due, in a large part, because Montana had several days in which its raw case rate was 0, and Montana’s 7-day moving average on April 18th (shown on the next graph) was only an astounding 2.4 cases per day! Given that Montana has 1,068,778 inhabitants, according to the 2019 US Census estimate, this equates to the 2.25 cases per million inhabitants (rounded to 2) shown above Montana’s graph.

No alt text provided for this image

As case and death rates do decline toward zero, moving averages—as well as other measures of rate—can become too sensitive to change. For this reason, subsequent geospatial analyses depict states with low (per capita) case rates in gray to mitigate against false spikes. Notwithstanding these caveats, moving averages provide a straightforward method to smooth unwanted variance, elevating the signal-to-noise ratio so that trends can be revealed.

Let’s Get GEO

Viewing states geographically can also demonstrate where the nation is in its COVID-19 containment and abatement. The following choropleth depicts the number of days in a two-week period for which the state COVID-19 case rate (7-day moving average) increased, with reds indicating states with expanding case rates, and green indicating states with decreasing case rates. Note that Montana, Alaska, Hawaii, and Vermont are shown in gray due to their low per capita case rates, as described previously.

No alt text provided for this image

This is encouraging news for the states in green, especially considering that many were yellow to red in previous weeks. For example, expanding the data range from a two-week to 30-day period demonstrates states such as Washington that have turned from yellow to green, and Arizona that have turned from burnt sienna to yellow. Color ramping is consistent in these (and subsequent national, state, and county maps) to facilitate comparison, with “percent of days of increased case rates over [X] days” proportional.

No alt text provided for this image

Comparison between the two maps can also, unfortunately, help identify states that may be trending upward after an initial decline—in which their 30-day case rate is better than their more recent two-week case rate. Florida, for example, trends green when viewing the past 30 days of case rate increases and decreases, whereas now it is trending yellow in the current two-week period. Examining daily case rates helps show that the state’s COVID-19 case rate has plateaued and may again be rising; this could be due to increased spread, although increased testing penetration could also cause case rates to rise.

No alt text provided for this image

Diving further into Florida's data, the county level provides more granular detail about which specific regions are experiencing more days of increased versus decreased (or stable) case rates, again utilizing the 7-day moving average as a more accurate depiction of case rate point in time. This map depicts Florida case rate averages from May 4th through May 18th, thus incorporating data from May 1st through May 21st. As before, counties whose May 18th case rates (7-day moving average) fell below five cases per million inhabitants (based on US Census 2019 county population estimates) are shown in gray.

No alt text provided for this image

These data demonstrate that several Florida counties do have increasing COVID-19 daily case rates, as reported over the recent two-week period.

Yet Florida is not alone, as at the national level, numerous states are experiencing pockets of increasing case rates despite others clearly bending the curve.

No alt text provided for this image

Arizona has several counties with rising COVID-19 case rates, in part due to COVID-19 sadly wreaking havoc on Native Americans and reservations. 

No alt text provided for this image

Densely populated Southern California has COVID-19 case rates that continue to rise, despite the state’s long-running shutdown. These rising rates are contrasted with California's less densely populated Central Coast.

No alt text provided for this image

Minneapolis, Minnesota, and its surrounding counties are also seeing a surge in the number of days (in the past two weeks) for which the case rate (7-day moving average) increased.

No alt text provided for this image

Zooming out to the Minnesota state-level, this increase is discernible as the state’s case rates continue to trend up.

No alt text provided for this image

Concluding Thoughts

As phased reopening continues, and as the fierce debate rages on about whether, how, and how fast to reopen businesses, release restrictions, and return to some semblance of normalcy, there is at least a consensus that these decisions should be data-driven, and informed to a large degree medically by COVID-19 case rates, per capita case rates, and state and local trends. Although this analysis has only scratched the surface of the available data, it demonstrates the benefit of utilizing data smoothing techniques (such as a 7-day moving average) to more contextually represent case and death rates, inasmuch as the risks of drawing specious conclusions when only raw data are relied upon.

Data Sources and Analysis

All COVID-19 data downloaded from the Johns Hopkins University (JHU) COVID-19 GitHub repository (https://github.com/CSSEGISandData/COVID-19) on 5-22-2020 and reflect data current as of 5-21-2020.

State-level shapefiles were downloaded from the US Department of Transportation (https://data-usdot.opendata.arcgis.com/datasets/states) and rendered in Python using the Matplotlib, Shapely, and Geopandas modules.

County-level shapefiles were downloaded from the US Department of Transportation (https://data-usdot.opendata.arcgis.com/datasets/counties) and rendered in Python using the Matplotlib, Shapely, and Geopandas modules.

State-level population statistics relied on US Census 2019 estimates (https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/state/detail/)

County-level population statistics relied on US Census 2019 estimates (https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-total.html)

All data download, cleaning, aggregation, transformation, and visualization performed in Python 3.7 through automated scripts that generate results and 4K-resolution data products nightly when the JHU GitHub repository is updated. These scripts scale to produce thousands of graphs and maps daily, of which a handful were selected for this analysis.

Charu Shankar

SAS trainer | Health & Life Coach | Yoga Instructor

4 年

Fantastic analysis & great visuals, great read as always Troy Hughes!! thank you

Tracy Smith

Available for consulting

4 年

Troy has hit the bullseye on yet another paper. He articulates the advantages of using smoothing techniques for those not accustomed to statistics (Or remember) while ?visually displaying what the data is saying. ?Thank you for your insights.

Louise Hadden

Analytic Lead / Programmer Specializing in Complex Reporting and Visualizations

4 年

Great data driven graphics to inform viewers on, and track, the effects of the global pandemic in the United States.

要查看或添加评论,请登录

Troy Martin Hughes的更多文章

社区洞察

其他会员也浏览了