COVID-19: Failure is an Option

COVID-19: Failure is an Option

Data scientists have collectively failed us in a manner that has great consequence in the SARS-2 (COVID-19) pandemic. This failure is embedded in our seemingly intense desire to reduce data to its coarsest level. Call this process “destructive summarization”. It is a common process by which more useful data is routinely transformed into much less useful data. An industrial example is when multiple property measurements are reduced to the product either passing or failing. In destructive summarization the data analyst does not get to see the original data, they only get the summarized data. Destructive summarization has two degrees, one is recoverable, the data which was summarized still exists but was not provided. The permanent form is when the original data upon which summarization is based is discarded and only the summary data is kept. What might this have to do with routine Covid-19 data? 

In any data research effort, data scientists need to specify the data that is required for analysis. The data needs to be timely, relevant and support the answering of known questions and known desired actions. This appears to be an uncompleted task for routine Covid-19 data collection with profound consequences. Where is this “data crime” of destructive summarization being routinely committed? A laboratory test is run; then the test results are reported as either you have Covid-19 or not. It is a yes / no summarized result or by analogy you pass (no Covid-19 detected) and fail if it is detected. Isn’t this exactly what we want to know? It is useful to know this, but the answer is no. 

What does data look like prior to destructive summarization? For this purpose, only the most common test using PCR (roughly iterative replication of genetic material trying to detect the disease marker(s) that are being searched for). The test is run up to some maximum number of times, say 40, and the test is known to gradually become less reliable as the number of iterations become large. Either at some cycle count you are determined to have Covid-19 or if you reach the maximum count without detection then Covid-19 was not found. Why is the replication threshold at which Covid-19 was detected so much more useful than a simple yes or no?   

The replication threshold at which Covid-19 is detected is a proxy for how severe the infection is. The lower the replication count, the more severe the infection. The higher the replication count is the more likely that the infection is not active (person unlikely to currently spread Covid-19, but was likely exposed to it) and as replication count increases the reliability of the test itself substantially deteriorates. Imagine what can be done with data that has not been destructively summarized. Having a crude measure of infectiousness and a crude measure of test reliability is highly valuable. 

Contact tracing could be prioritized to address those that are likely to have active infections and even those with the most active infections. Contact tracing fails when data does not support the prioritization of infections and the number of infections becomes too large to manage. Better data test resolution allows better discrimination of how to respond most effectively to the pandemic. Timeliness of data acquisition and timeliness of contact tracing also greatly impact the practicality of contact tracing. 

Understanding even a one level less coarse data summarization such as likely to have active Covid-19, uncertain if Covid-19 is active or not, unlikely to have active Covid-19 is profoundly useful. Several patients arrive at a hospital ER relatively equally ill all having some difficulty breathing and suppose all are known to have received their test results that day indicating they are positive for Covid-19. How should they be treated or further tested? For example, what if one tested positive but is also highly unlikely to have an active Covid-19 infection? This is a critical piece of information for the ER medical professionals. Paths of treatment could depend on this information; the patient might benefit from Tamiflu rather than a Covid-19 focused treatment. Even a very crude measure of degree of infection is invaluable for determining an optimal patient treatment strategy. 

It is time that the key data scientists better recognize that effective data science is always a strong function of both how data is summarized as well as which data is collected. Data scientists need to become obnoxiously insistent on the data that is needed for problem solving versus the data needed for reporting purposes. Without applying destructive summarization, one has both. Progress requires changes to our data reporting and summarization practices. 

Failure is an option, but let’s not continue to suffer the consequences of key data reporting over-simplification. 

No claim of expertise in PCR based testing or medicine is intended in this note. While the author is a data scientist, he functions as such primarily outside of the contexts of Covid-19 data, medicine, and diagnostic tests. 


要查看或添加评论,请登录

Tom Bzik的更多文章

社区洞察

其他会员也浏览了