What Does Synthetic Data Mean In Healthcare’s Artificial Intelligence Revolution?
Bertalan Meskó, MD, PhD
The Medical Futurist, Author of Your Map to the Future, Global Keynote Speaker, and Futurist Researcher
Data is the foundation of artificial intelligence. As the importance of A.I. grows in modern medicine, there’s a huge need for data (as well as data annotation) – the latter being one of the most important aspects of the work in building an algorithm. In healthcare, collecting data means utilising existing databases and using images, radiology results, samples, CT or MR scans, patient records and more. The more data you feed the system, the better the results can become.?
Artificial intelligence has earned its place in multiple fields of medicine, from recognising patterns, supporting diagnoses and setting up treatment pathways to optimising healthcare logistics. Smart algorithms can sift through large volumes of data no man can, deriving clear-cut trends from such analyses.?
It’s easy to guess that this data includes your own health-related data: EMRs, smartwatches, genetic reports, wearables and so on are all means to feed the A.I. with datasets. But what if we would never be able to obtain enough data to contribute to the progress of A.I. in healthcare??
What if privacy concerns don’t allow hospitals to share medical records with companies?
That’s when synthetic data comes in.
Is synthetic data fake?
It is fake but it’s based on real-life data. Moreover, it’s possible to use methods that ensure that synthetic data very much resembles the real one. One of these methods is called generative adversarial network (GAN).
Let’s imagine there’s a painter who wants to create better and better copies of Picasso’s paintings to sell them as real ones. On the other end, there’s a policeman who wants to catch him by spotting these fake Picasso paintings.
By painting more and more of the fake ones, the painter is getting gradually better and better at creating fakes. At the same time, while going after him, the detective is also getting better at recognising those works of art that are replicas. They both keep trying to beat each other and, after many iterations, the painter creates images indistinguishable from a real Picasso. This was the goal of the whole experiment with machine learning.
And this is exactly how A.I. can create synthetic photos of birthmarks (or in fact anything else) to ensure the algorithm’s development – in the case of birthmarks, to be able to better detect melanoma or other skin issues. Based on existing data, the algorithm attempts to generate data that is somewhat different from the original, but not so much as to lead to a false result. So it IS fake – but it isn’t.
As this study clearly states, “synthetic data can be created from perturbations using accurate forward models (that is, models that simulate outcomes given specific inputs), physical simulations or A.I.-driven generative models.”
领英推荐
Why is data important in healthcare A.I.?
The biggest obstacle to A.I. is the inadequacy of the available data. Without patient data, there is no A.I. in healthcare. On one hand, the amount of data needed for effective algorithms in healthcare is crucial as a huge amount of data is needed to feed the algorithms. On the other hand, data needs to be annotated, drawing lines around tumours, pinpointing cells or designating ECG rhythm strips – that’s why the altruistic role of data annotators is so important.
Above all that, privacy concerns limit the amount of available data in medicine. Working with sensitive patient data is a tricky issue. It seems we cannot keep our privacy intact AND also benefit from A.I.'s advantages in our care. We saw in many cases how sensitive information can get leaked even unintentionally – and we are not even talking about hacking or privacy, just a poorly protected database. New methods like federated learning might make it possible to do this without breaching patients' privacy, but its scope is limited.
That is where synthetic data could be of help. It can fill in the missing data, making it possible to produce entirely fabricated patient datasets that are just as useful for training A.I. as the real thing, while keeping patient data protected.
Privacy, quality and bias
With the use of such trained datasets, even the existing bias could be overruled in A.I. programming. There’s an ongoing issue in A.I.-based programming due to the limited access to data focusing on race, skin colour and other matters. An MIT Media Lab study found that facial-recognition systems from companies like IBM and Microsoft were 11-19 percent more accurate on lighter-skinned individuals.
Synthetic data could help overcome this challenge as the training could focus on such variables, making use of real-world environments. Using the above-mentioned example, how to diagnose melanoma on dark skin toned patients – as often previous algorithms have failed to be able to do so.
Source: www.geneticliteracyproject.org
Hands-on use
Synthetic data already has a number of practical use cases. A group of researchers in Michigan have developed a computer vision model to help improve pathologist decision support to more accurately diagnose brain tumours. Their challenge was that if they wanted to use brain scans from other institutions, the algorithm’s efficiency dropped as it could not compare the different types of scans.?
By using synthetic data trained on much larger datasets, their algorithm was “better able to learn what to look for in our pathology images” – Dr. Todd Hollon, neurosurgeon and principal investigator of the machine learning in neurosurgery laboratory at Michigan Medicine explained.
Synthetic data might not be the holy grail for solving all the issues healthcare A.I. programming poses. (Some even claim it can not effectively add to the privacy issues raised). However, it can provide a wider scope for research and, in principle, add to the protection of privacy in medical data.?
ICT Advisor I MIS & MEL I Data Science I AI & e-Gov
2 年In this digital age data privacy plays a pivotal role to stop data theft. Any comments on this by the author.
Founder & Managing Director at Nexmed Healthcare Solutions
3 年That's why a fundamental precursor to the design and implementation of Ai solutions is that more hospitals become digital quicker. This allows additional data streams to become available for Ai modeling and machine learning.
In terms of racial biases in the patient/hospital data, unfortunately, those cannot be mitigated with synthetic data. Whether intentionally or not, physicians may treat patients of different races differently. There are at least two ways to solve this problem: 1) De-biasing of algorithms (IBM and Microsoft): https://pc3i.upenn.edu/recognizing-and-addressing-bias-in-health-care-ai-algorithms/ 2) Being selective about and attentive to your data. Some datasets, by definition, contain fewer biases than other datasets. For example, a dataset of medical studies would be less biased than a dataset of patient visits - although it may also serve a different purpose.
Author of 'Advancing AI in Healthcare' | Healthcare AI Fraud Investigator
3 年Synthetic data has been used in statistics for over 70 years in the jackknife (and later bootstrapping) resampling techniques. No statistician would call it fake data because it has all of the characteristics of the real data. For example, WellAI data scientists use bootstrapping for model training on medical vignettes. They also use GANs in developing its dermatology model: https://www.dhirubhai.net/posts/wellaillc_how-general-adversarial-networks-gans-will-activity-6876730155771981824-mF2l
--
3 年Disgusting thing synthetic data and original data more importantly.