County-less COVID-19: How “Unassigned” and “Out of State” Counties Are Impeding JHU Data Analysis

County-less COVID-19: How “Unassigned” and “Out of State” Counties Are Impeding JHU Data Analysis

On March 22, Johns Hopkins University (JHU) COVID-19 “Daily Reports” CSV files (https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports) began incorporating county-level FIPS codes for all US COVID-19 cases and deaths. Despite this update and tremendous improvement in data granularity, thousands of US cases (occurring after March 22) nevertheless lack a valid FIPS code, and thus cannot be attributed to a specific county. This degradation of data impedes COVID-19 analysis, especially for those states having higher levels of missing counties, including Georgia, Michigan, Rhode Island, New Mexico, Arkansas, Tennessee, Indiana, and Louisiana.

This analysis examines cases for which the COVID-19 county (i.e., Admin2 column in JHU data) is listed as “Unassigned” or “Out of State” (e.g., “Out of GA,” “Out of MI”), or otherwise contains a value that cannot be attributed to a specific county-level FIPS code. As of today, 14,386 COVID-19 cases and 1,345 COVID-19 deaths lack a county.

Heartfelt thanks to JHU for creating and continually updating the COVID-19 GitHub repository (https://github.com/CSSEGISandData/COVID-19) whose raw data comprised the substrate of this analysis. JHU simultaneously maintains the COVID-19 international tracker (https://coronavirus.jhu.edu/map.html), and both resources continue to be instrumental to the researchers, epidemiologists, medical professionals, and others combating the spread of the virus.

“Daily Cases” File Format and Unknown Categories

The file format for the “Daily Cases” CSV files has changed several times over the course of the pandemic, as described in two prior posts that interrogate the data quality of JHU COVID-19 CSV files: https://www.dhirubhai.net/posts/troy-hughes-27a998a8_covid-19-jhu-daily-reports-data-quality-activity-6671173160479547392-qUWD and https://www.dhirubhai.net/posts/troy-hughes-27a998a8_jhu-daily-reports-covid-19-longitudinal-activity-6672149643671035904-fKAM. Most notable to US county identification, the FIPS column was added on March 22, prior to which county is typically not listed for COVID-19 cases and deaths.

After March 22, many FIPS values continued to be omitted, or in some cases, a valid county name was replaced with one of two arbitrary constructs: “Unassigned” or “Out of State.” Unassigned records denote “Unassigned” in the Admin2 column, and in earlier CSV files, these records had no FIPS value. However, in later CSV files, a unique (yet arbitrary, invalid) “FIPS” code was provided for Unassigned cases. For example, "Out of GA" is uniquely assigned the value 80013 in the FIPS column, despite not being a US Census-recognized FIPS value.

Unassigned records represent those cases attributable to a state but not to a county therein, and for which assignment to a county should occur in the future. These Unassigned buckets enable state-level COVID-19 analysis to rely on the most recent data while county-level analysis is understood to have missing values that will be added at a later point. Notwithstanding this delay, county-level determination should eventually be made, so Unassigned cases should trend toward zero over time for each state. This expected decrease contrasts with cumulative case counts for known counties, which should increase or remain stable over time.

Out of State records represent the second category of missing counties, in which the county (i.e., Admin2 column) is denoted as “Out of State,” with state referenced as the two-letter state abbreviation. For example, “Out of GA” denotes “Out of Georgia.” This designation is more ambiguous than the Unassigned category, in part because JHU does not define “Out of State” in its documentation or its GitHub data dictionary.

The NYTimes, which further synthesizes raw JHU data, clarified via email that “States seem to use "Out of State" in two ways. If it refers to residents of other states who have been diagnosed in the state in question, then we exclude those cases. If "Out of State" refers to residents of that state who have been diagnosed in another state, we would include those cases within Unknown.” The NYTimes maintains a separate GitHub repository: https://github.com/nytimes/covid-19-data.

In its “Methodology and Definitions” section, the NYTimes further clarifies that “Many state health departments choose to report cases separately when the patient’s county of residence is unknown or pending determination. In these instances, we record the county name as “Unknown.” As more information about these cases becomes available, the cumulative number of cases in “Unknown” counties may fluctuate. Sometimes, cases are first reported in one county and then moved to another county. As a result, the cumulative number of cases may change for a given county.

Other unknown represent the third category of cases for which a county is not identified. The most notable examples of this category stem from Michigan, in which 4,130 cases (and 68 deaths) are attributed to the “Michigan Department of Corrections (MDOC)” and 164 cases (and 5 deaths) to the “Federal Correctional Institution (FCI).” Unlike Unassigned and Out of State cases that should trend toward zero over time as these cases are assigned to valid counties, this third category of unknown counties will continue to increase as new cases are identified, and will not decrease unless Michigan reformulates how it tracks inmate cases.

COVID-19 Cases with Missing County

As of June 14, 14,386 COVID-19 cases exist with no attributable county. Note that the bars in this figure represent cumulative case totals, not daily case rates, so it is alarming (from a data quality perspective) that the number of cases having no identifiable county continues to increase.

No alt text provided for this image

Only the “Other Unknown” category (shown in blue) should be increasing over time, as this category includes case totals for records such as those of Michigan inmates. However, both the Unassigned and Out of State categories should be resolved over time (and attributed to their respective counties) and should not be increasing. Thus, in the previous figure, Michigan’s high case totals reflect Michigan’s decision to report data incongruously from other states. Conversely, Georgia, Rhode Island, New Mexico, and Tennessee are inexplicably failing to report county for a significant number of cases.

The incidence of records without a county is highly dependent on state, as demonstrated in the following figure. For example, the initial spike in New Jersey cases around April 1 corresponds to the logarithmic growth in cases that New York and New Jersey were experiencing in the height of the US pandemic. From JHU data alone, it is impossible to determine to which counties these nearly 5,000 county-less cases belonged on April 1. However, these Unassigned records were later assigned (retroactively) as New Jersey case rates slowed, peaked, and began to decrease. This improvement in data quality is reflected in the decreasing trendline (dashed green line) of New Jersey in the following figure.

No alt text provided for this image

Georgia, on the other hand, contrasts sharply with New Jersey, as its number of cases without a county has steadily increased throughout the pandemic. Thus, not only will historical Georgia data be less precise, but current Georgia data (from which emerging trends should be derived) will also lack county-level specificity.

The number of cases in which the county is missing is a subset of each state’s total number of cases. Thus, a more equitable representation of missing county data (and one which facilitates interstate comparison) instead graphs the percentage of each state’s cumulative population for which county is missing. Rhode Island trails all other states with 11.7 percent of all cases lacking county information, with Georgia and New Mexico following, at 7.8 and 7.0 percent, respectively. It is unclear why some states are failing to report county-level data in a timely manner, as well as failing to correct and report these data over time.

No alt text provided for this image

COVID-19 Deaths with Missing County

JHU COVID-19 deaths show the same disturbing pattern as COVID-19 cases, in which county is too often missing. As of June 14, 1,345 deaths are not attributed to a specific county, with the majority of these deaths denoted as Unassigned.

No alt text provided for this image

As before, the incidence of deaths not attributed to a specific county is primarily influenced by state. Rhode Island has never reported deaths at the county level, so all Rhode Island deaths lack county attribution. Rhode Island does provide additional information on its Department of Health COVID-19 dashboard, although these data are unfortunately not integrated with the JHU GitHub repository or JHU dashboard.

No alt text provided for this image

When viewed as a percentage of cases, it is evident that Wyoming deaths are also missing county, with 94.4 percent (17 of 18 deaths) lacking county data. Inspection of the Wyoming Department of Health's COVID-19 dashboard (https://health.wyo.gov/publichealth/infectious-disease-epidemiology-unit/disease/novel-coronavirus/covid-19-map-and-statistics/), however, details the counties in which these deaths occurred, so in this instance of missing data, the breakdown appears to lie somewhere in the Wyoming-JHU data ingestion pipeline. This issue is logged within GitHub (https://github.com/CSSEGISandData/COVID-19/issues/2712) and visualized below.

No alt text provided for this image

Unknown County Plaguing Historical Analysis

Care must be taken when analyzing historic JHU data at the county level, especially where a high percentage of records is missing county information. In general, JHU “Daily Reports” are not corrected historically, so values unknown at a point in time will remain unknown for those dates and within those CSV files. This is demonstrated most clearly in the historical analysis of New Jersey COVID-19 cases for which county was missing.

For example, an analyst reviewing New Jersey COVID-19 cumulative cases on April 6 would have observed the following county-level trends. The number of cases and cases per million county residents are listed for the seven counties having had the highest cumulative per capita COVID-19 cases reported on April 6.

No alt text provided for this image

But what about the unreported county-level incidence on April 6—the New Jersey cases for which county was unknown at the time? These cases are referenced in the third footnote, which admonishes that “8.6% (N=3,521) of New Jersey COVID-19 cases are not attributed to any county.” Although these cases were corrected going forward in subsequent weeks (in later JHU CSV files), the original CSV files (with missing counties) are unchanged in the JHU GitHub repository, in keeping with JHU’s decision not to correct data retroactively.

Thus, this historical picture of New Jersey underestimates county trendlines, owing to the more than 3,500 cases that are omitted (i.e., unattributed to any county) on April 6. This can especially skew analyses if the cases lacking county information are not distributed evenly—for example, if a majority of these cases would have been attributable to a couple or a few counties.

As of June 14, New Jersey now has only 196 cases for which the county is unknown.

No alt text provided for this image

This graph illustrates a success story, in that although the state initially succumbed to a high rate of cases, New Jersey rapidly assigned these cases to their respective counties. However, other states have not fared so well, as demonstrated in subsequent examples.

Unknown County Plaguing Current Analysis

New Jersey and many other states have overcome this data integrity issue by successfully assigning Unassigned cases and deaths to their respective counties; however, some states are failing to do so, and even falling further behind. The following subsections enumerate the four greatest offenders in missing county data—Rhode Island, Georgia, New Mexico, and Indiana. As precise, county-level case and death rates are one essential tool in helping to identify newly emerging pockets of COVID-19 as well as local regions with rising rates, the lack of county identification can plague ongoing modeling and predictive analysis.

Rhode Island

Rhode Island, for example, has 11.7 percent (N=1,870) of cases that lack a county.

No alt text provided for this image

Rhode Island deaths are reported with even lower specificity, with no data available (via JHU) at the county level. The JHU FAQ page describes “Rhode Island: Currently, Rhode Island reports confirmed cases by county, but only reports deaths at the state level. This means that we are unable to report out on deaths at the county level. You will find the Rhode Island count of deaths are grouped under 'Unassigned.'” (https://coronavirus.jhu.edu/us-map-faq)

The Rhode Island Department of Health does maintain an active COVID-19 dashboard (https://ri-department-of-health-covid-19-data-rihealth.hub.arcgis.com/) that includes test rates and case rates by city and ZIP code; however, the inability to fuse these data to the JHU data repository thwarts interstate analysis, forcing analysts to derive data directly from the state (or from other sources such as the NYTimes GitHub repository) for county-level analysis. This is both unfortunate and unnecessary.

Georgia

Georgia is the second epic example of a state with an inexplicably high number of cases that have no county recorded. These include 1,675 Unassigned and 2,807 Out of State cases.

No alt text provided for this image

Although Georgia is trending toward increasingly more cases that lack county information, it has made intermittent strides to improve data quality, such as the decrease in Unassigned cases from May 24 through May 29, during which time many historical records were assigned a county.

This county assignment improves data quality, albeit while introducing new challenges to data analysis. For example, consider the following figure that demonstrates Georgia daily case rates over time. Note the corresponding substantial spike in Chattahoochee cases during this May 24 period, as records previously recorded as Unknown were effectively shifted to this county all at once.

No alt text provided for this image

This same shift from Unassigned county to Chattahoochee can be observed in Georgia cumulative case counts, in which the number of cases rose vertically.

No alt text provided for this image

In sum, although these hapless COVID-19 cases should be ascribed to their respective counties to facilitate county-level analysis, this increase in data quality will confound some results, making it appear as though some counties have spiked, when in fact they have only been infused with old cases from days, weeks, or even months in the past.

The Georgia Department of Public Health lends some explanation to Georgia COVID-19 cases that lack county information. Their COVID-19 Dashboard Guide defines “county” as “Reflects the county of residence. This data element is often unreported to DPH, and in such instances, is reported as ‘Unknown.’” (https://ga-covid19.ondemand.sas.com/docs/GA_COVID19_Dashboard_Guide.pdf) This helps to explain how Georgia records Unassigned cases, but fails to justify the significant number of Out of State cases, especially as compared with that of other states.

The Georgia Department of Public Health releases CSV files daily that describe its COVID-19 cases and deaths (https://ga-covid19.ondemand.sas.com/docs/ga_covid_data.zip). Within the enclosed countycases.csv file, both “Non-Georgia” cases (N=2,909) and “Unknown” cases (N=1,551) are included, which respectively map to “Out of State” and “Unassigned” cases, as recorded by JHU. When the cases within the CSV files are summed, the total matches the 58,414 cases on the COVID-19 Department of Public Health dashboard (https://dph.georgia.gov/covid-19-daily-status-report).

No alt text provided for this image

Credit: Georgia Department of Public Health COVID-19 Dashboard

Thus, although Georgia is reporting “Out of State” cases on its dashboard, these cases are omitted from geospatial and other county-level analysis on the same site, potentially impeding the ability to identify outbreaks and other trends at the local level.

This is best demonstrated in the Georgia map prominently displayed on its dashboard, which states that “This chart is meant to aid understanding whether the outbreak is growing, leveling off, or declining and can help to guide the COVID-19 response.” The implementation has two major flaws, the first being that if a significant portion of new cases are being aggregated into various Unknown categories rather than counties, county-level trends will be muted and possibly unobservable.

The second issue with the dashboard is the description that “The charts below presents [sic] the number of newly confirmed COVID-19 cases over time.” The map beneath this description is a choropleth that is shaded by cumulative number of cases (or number of cases per capita, as is shown here) rather than daily case counts, as described. Thus, rather than representing the latest trends in COVID-19 cases, this map will instead be less responsive to spikes because it demonstrates several months of cumulative data. This is apparent in the data for Randolph County. For example, the 2,828 cases (shown both in the popup and the legend, represent 2,828 cumulative COVID-19 cases per 100,000 county residents, and not the daily case count as is described.

No alt text provided for this image

Credit: Georgia Department of Public Health COVID-19 Dashboard

New Mexico

New Mexico is yet another state with a substantial percentage of cases that are missing county, at 7 percent (N=679) of cases, all of which are Unassigned.

No alt text provided for this image

The New Mexico counties having the highest per capita incidence of COVID-19 are demonstrated in the following figure. Again, it is unclear where these 679 cases should be assigned in this county-level analysis, so unfortunately they must be omitted and can only be referenced as a footnote.

No alt text provided for this image

A recent bump in case rates in New Mexico around June 4 may explain the state's delay in reporting data to JHU, as the state struggles to process these data.

No alt text provided for this image

This increase is also visible from cumulative cases, which demonstrate that some counties are still continuing to experience high case rates.

No alt text provided for this image

As previously stated, the high percentage of missing counties can impede analysis, as it’s unclear to which counties these cases should be assigned, as well as why they are languishing without resolve. The New Mexico Department of Health COVID-19 dashboard (https://cvprovider.nmhealth.org/public-dashboard.html) does include these Unassigned cases in its cumulative number of cases; however, as previously encountered, these Unassigned cases are omitted from the New Mexico map and other county-level analyses on the dashboard, with no mention that a significant portion (7 percent) of cases are missing.

Indiana

Within Indiana, 7.5 percent (N=182) of deaths are missing county identification.

No alt text provided for this image

These deaths are described both on the Indiana COVID-19 dashboard (https://www.coronavirus.in.gov/) as well as on the NYTimes GitHub site (https://github.com/nytimes/covid-19-data/issues/263), and represent the May 6 shift in Indiana toward reporting probable COVID-19 deaths. It is unclear why the state has not retroactively attributed these deaths to their respective counties, given that JHU includes the deaths in its national and state totals.

No alt text provided for this image

Credit: Indiana COVID-19 Dashboard

These examples have showcased cases or deaths attributed to a state but which cannot be identified at the county level. The intent is to inform analysis as well as to encourage county-level reporting to best facilitate the identification of trends and the emergence of increasing case rates.

Data Sources and Analysis

All COVID-19 data were downloaded from the Johns Hopkins University (JHU) COVID-19 GitHub repository (https://github.com/CSSEGISandData/COVID-19) on 6-14-2020.

State-level shapefiles were downloaded from the US Department of Transportation (https://data-usdot.opendata.arcgis.com/datasets/states) and rendered in Python using the Matplotlib, Shapely, and Geopandas modules.

County-level shapefiles were downloaded from the US Department of Transportation (https://data-usdot.opendata.arcgis.com/datasets/counties) and rendered in Python using the Matplotlib, Shapely, and Geopandas modules.

State-level population statistics relied on US Census 2019 estimates (https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/state/detail/)

County-level population statistics relied on US Census 2019 estimates (https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-total.html)

All data download, cleaning, aggregation, transformation, and visualization performed in Python 3.7 through automated scripts that generate results and 4K-resolution data products nightly when the JHU GitHub repository is updated. These scripts scale to produce thousands of graphs, maps, and videos daily, of which a handful were selected for this analysis.

Bill Clancy

CISSP, CCSP, CISA, CISM, CRISC, CDPSE, CEH, CNDA

4 年

As always...Troy is the "Data Wrangler", the untangler of data that just isn't quite right.

Bill Clancy

CISSP, CCSP, CISA, CISM, CRISC, CDPSE, CEH, CNDA

4 年

It's a bit alarming...

回复

The y-axis is labeled as cumulative cases - therefore how can the height of the bars be going down in several places along the timeline?

要查看或添加评论,请登录

Troy Martin Hughes的更多文章

社区洞察

其他会员也浏览了