Covid – 19: Lies, Damn Lies and Statistics
Regardless of who you attribute the quote to, many distrust conclusions drawn from government data, and rightfully so. The data that is used, the data that is excluded, the quality of the data, the way it is presented, the data sources can all influence the veracity of any conclusion we draw from it. In the age of “Big Data”, this is truer than ever. Data Scientists are now highly paid specialists hired to make sense for us using algorithms and other AI techniques to find patters and make distinctions.
One would think in the midst of a pandemic that data and understanding would be the rule, that all parties would seek truth. Turns out that data is being manipulated to influence public opinion and personal action. The latest headline came a few days ago when the State of Florida fired the data scientist responsible for their health-related dashboard and then smeared her. The facts are not clearly understood yet, but the fired scientist claims her dismissal was due to a refusal to manipulate the data.
One wonders how it could be so difficult to present facts and figures relating to the numbers of tests performed, number of new cases, and number of cases. Turns out, pretty difficult. How difficult, let me count the ways.
Test types:
There are two main tests: molecular and other tests that use saliva or mucus samples to see if you are currently infected and a blood test that looks for evidence that you have been infected in the past. These tests are very different and are conducted for different purposes. However, the CDC and some states grouped these into the same categories. They shouldn’t be otherwise you can present a distorted view of either herd immunity or new case counts.
Test data quality:
There are lots of different tests. Each has a level of accuracy. Current detection tests are more accurate than the blood tests. However, even detection tests aren’t perfect. They generate false positives and false negatives. Between the individual manufactures there is variance. The blood tests used to identify where antibodies are present as a result of a Covid 19 infection are an order of magnitude worse than the tests use to detect the virus.
The quality of the data can also be impacted by who is administering the test. As example, using nasal swabs often requires sticking the swab high into the lower nasal passage. Unless professionally administered, the swab may not collect the correct sample residue and generate bad test results.
Finally, the timeliness of the data is important. Getting data into databases and ensuring that the date the test data was actually generated is significant when doing time series studies.
What is counted:
When virus was first reported, there weren’t good reporting standards. For example, determining what counted as a Covid-19 related death wasn’t clear. Was some one who died while manifesting Covid 19 like symptoms, but who wasn’t tested counted? If you died in a location away for your home, were you counted in the location you died or in the location of your residence? Were you counted at the time of the doctor’s declaration of death or only after confirmed by the coroner’s office?
Population tests, sample sizes and sampling methodologies:
Covid-19 is more than a nasty killer virus. It is a statistician’s and an epidemiologist’s worst nightmare. Many who contract the virus remain asymptomatic. This make tracing quite difficult. It also skews the data. People who don’t exhibit symptoms won’t get tested. Those who manifest severe symptoms generally get tested. Those who have milder cases, probably won’t get tested either. This means that actual case counts are understated in the relative to the total population.
Trying to get a clear understanding of herd immunity is a challenge. Given the propensity for false positives and false negatives, getting precision from the data will require multiple tests. Further, pulling the right sample groups is a challenge. Getting bias out of the sample will be tough given that hospitals and the populations are tied to the factors like location and access to health insurance.
So What To Make of All of This?
One need only go back to the Middle East Wars in Iraq, Afghanistan, and before that, Vietnam to see how government Agencies manipulate data to serve a purpose. The current Administration casts President Trump as a war time President. You can bet a wartime President is manipulating the data. Think back to the daily briefings when the President would constantly highlight the same numbers about the availability of ventilators and other PPE. While he was claiming there were surplus there was both anecdotal and factual evidence that the equipment wasn’t getting out to where it needed to be quick enough.
The President is now harping on the number of tests conducted. But as pointed out earlier these total number figures are fairly meaningless in the aggregate. Conclusions that might be drawn from them are somewhat questionable. I pull raw data from GitHub sponsored by the New York Times on a routine basis. At a county and state level, it is obvious that most counties should not be opening based on the CDC guidelines.
The bottom line is question everything! Find out the sources of data for your own communities and determine how there are being collected. It is really important to look at municipal and county level data. Look at who is publishing the data and question their motivation for presenting the data. Is it unbiased? Is it serving an agenda? If so, what agenda?
Covid 19 can kill anybody. Certainly, there are populations that are more susceptible than others. While the government has put out guidelines, it is your personal decisions that determine your own risk. Do the investigative work before making any decision that impacts you and your family’s safety.