Using Data Science and Machine Learning to stop Covid-19
As the threat posed by Sars-CoV-2, better known as the "Coronavirus", and the associated restrictions and lockdowns have us all in suspense, I find myself asking: Are we really doing enough to make sure our healthcare system is being utilized as effectively as possible in the face of impending overload - particularly of intensive care facilities? When you consider the impressive successes of Artificial Intelligence and Machine Learning in the field of medicine, this question seems especially justified.
Algorithms, mostly based on Neural Networks, have long been able to identify pneumonia [1, 2, 3, 4], malaria and skin cancer [5], as well as numerous other illnesses, with higher, or at least the same, accuracy as the best specialists in the respective areas. These specialists would not be rendered superfluous by algorithms - quite the opposite: Doctors would be granted much-needed additional time to take care of other tasks, such as improved patient education and information. Moreover, rather than taking the task of diagnosis out of the doctors’ hands, they would instead assist and supplement healthcare professionals in their work, also - indeed especially - in the case of Corona-infections.
Perhaps you are irritated by the abundance of “woulds” in the previous paragraph. The fact is, that the current situation, at least in Germany and the rest of Europe, is on the whole rather paradoxical. While research institutes continue to break records and deliver results in the field of computer-assisted identification of medical conditions, these innovative systems lag far behind when it comes to their practical adoption and application out in the field, at least in Europe [6].
The reasons for this include the various requirements and restrictions regarding data protection, both for research purposes as well as for actual use in a clinical environment. Put plainly: anyone intending to train and deploy an algorithm for clinical use in recognizing and diagnosing symptoms requires the data of hundreds – or better yet thousands – of patients. More importantly, according to Data Protection Regulations they would require the consent of every last one of these patients in order to use that data. For a given specific purpose (in this case the training of an AI/Machine Learning algorithm for use in a clinical setting) and even if the data is anonymized. However, the General Data Protection Regulation does allow for one exception to these rules, namely in cases where there is found to be a “public interest” in the data in question, which in the present Corona-crisis would undoubtedly be the case [7, 8].
You may now wonder why your telecommunications provider is allowed to sell anonymous connection data [9, 10]. After all, this is also personal data, i.e. your very personal data. The key lies in the definition of the word "anonymised".
According to current jurisprudence and legal interpretation, data can only be considered truly anonymized, if the anonymization or encryption cannot be retroactively undone/decrypted, at least not by users. For telecom providers, this is quite a simple matter: They MUST– according to the law – destroy the data after half a year, and you can’t decrypt data that no longer exists.
This is, of course, completely out of the question for medical data. Moreover, in order to train neural networks, it is occasionally necessary to be able to trace the data, such as when error analysis calls into question whether an illness has been correctly labelled (i.e. diagnosed) on a particular blood smear or X-ray image.
Data protection is naturally of the utmost importance – especially in this time of increased digitalization. Nevertheless, medical data can save lives. So, it falls to us to, on the one hand, ensure the protection of personal data, while, on the other hand, exploring and utilizing the possibilities they offer. Numerous national and international organisations - including the US Department of Health and National Institutes of Health (NIH) [11, 12] - as well as various international initiatives, such as “AI for Good” [13] or the Roundtable on Global Initiative and Data Commons [14], have come to the conclusion that our data – protected and anonymized according to rigorous security standards and regulations – can be considered a public good.
And why not?
After all, the German Minister of Health is of the opinion that we should forfeit our right to our organs in the case of brain death, unless we make an explicit previous declaration of our opposition to a transplant. Ultimately, this suggestion did not make it through the Bundestag; instead, citizens are to be asked more frequently whether they wish to opt in to post-mortem organ donation.
The importance of a similar solution regarding data is made clear by the possibilities offered by Machine Learning. A potential algorithm, trained to recognize and identify Corona-infections via X-ray or CT scan images of a human lung, would open up the following possibilities in the current crisis:
- According to recent media reports, doctors are currently forced to wait three or more days for the results of a Corona-test, provided they don’t have access to their own laboratory. Access to a detection algorithm would speed up diagnosis considerably, especially in more severe cases.
- In cases of overstretched intensive care units, the algorithm would allow for better targeted isolation of infected patients, which would both reduce the strain on hospital capacities and mitigate the risk of further, hospital-acquired infections.
- The algorithm could also be of help to doctors who have yet to come into contact with people infected with the virus, particularly in excluding Corona-infections.
- During later stages of the outbreak, the algorithm could be further refined to predict and identify cases where the infection has an increased likelihood of leading to a severe progression of the disease, allowing for earliest-possible specialized treatment.
Initial research into this area is in full swing. For example, hospitals in Wuhan are already using a similar algorithm. However, their data is not publicly accessible [15, 16]. Moreover, a data set drawing on exclusively Chinese patients would be highly problematic due to the possibility of a distortion (“bias”). Meanwhile, in the US, a scientist at Stanford University [17] is currently assembling a publicly-accessible data set. However, at just 105 radiographies from 65 patients [March 22. 2020], it is still far too small to meaningfully train a reliable algorithm on.
As of now and to the best of my knowledge, there is still no comparable initiative here in Europe. The Alexander Thamm GmbH would be capable of programming and training such an algorithm, to be made available to any interested hospital or clinic free of charge in just a short amount of time.
Alongside our hundreds of industrial projects utilizing Machine Learning, we also have experience in the field of medicine, such as in:
- The identification of pneumonia using X-ray images
- The identification of Malaria-infections using images of blood smears
- The classification of proteins using microscopic images
- Determining the severity of diabetes-related retinopathy using images of the eyeball
In order to realize this undertaking and to provide an algorithm as soon as possible during this tense present situation, we are seeking contact to hospitals and clinics able to supply us with anonymized data – i.e. lung X-ray and CT scan images – along with sponsors to support our endeavour. As a socially responsible company, we are prepared to shoulder part of the costs ourselves.
We will gladly provide a project sketch to any sincerely-interested sponsors or to partners from the field of medicine. Please direct your inquiries to:
Andreas Gillhuber (Co-CEO) - [email protected]
We would be happy to be of help in this current situation. We offer our sincerest thanks for your attention and interest.
Alex