The Good, Bad and Ugly aspects of using Synthetic Image Data for continual self- improvement of Computer Vision Models
Harsha Srivatsa
Founder-CPO @ MentisBoostAI | AI Product Leadership, Data Architecture, IoT, Impact Innovation, Systems Thinking | I help visionary companies build standout AI Products | Ex - Apple, Accenture, Cognizant, AT&T, Verizon
I have worked on two "hard and wicked" problems that leverage AI capabilities very well.
I read the excellent topic coverage for Synthetic Data in this paper "Best Practices and Lessons Learned on Synthetic Data for Language Models" [https://arxiv.org/pdf/2404.07503].
Towards the end of the paper, there was mention of future innovation work for "Emergent self-improvement capability" with two interesting points:
- If a Vision ML model use Generative AI capabilities to generate higher-quality data than its original training set, it could potentially bootstrap its own performance by iteratively learning from the enhanced synthetic data.
- This self-improvement capability could lead to the emergence of more advanced AI systems that can autonomously refine their skills and knowledge over time.
- Although recent work shows encouraging progress in this direction, the upper bound of self-improvement and the underlying reasons for its effectiveness remain open questions.
For my work on Driver Drowsiness and Distraction Detection System, I set about designing a ML Pipeline system that uses two feedback - action loops. The first loop generated Synthetic Image Data better than the real data that it was trained on thus enabling it to continually improve the data quality by itself. Second, the Machine Learning models get into a self-improvement mode recognizing the higher quality of input data and report on the output metrics so that the improvement with feedback-generation loop with even more higher quality Synthetic data becomes a self-improving mechanism.
However, there are good, bad, and ugly aspects to using Synthetic Data as I found out the hard way.
Good: The Vision Machine Learning Models recognize the higher quality Synthetic Image data and report better output metrics for Accuracy, Precision, Loss, Recall and F1 Score.
Bad: With continual input of Synthetic Image Data, the ML Models seems to be hooked on or addicted to Synthetic Image Data. Based on the programmatic thresholds for output metrics, they seem to reject real data or prefer to accept Synthetic Image Data only.
Ugly Part: With the introduction of real data, manual annotations and quality checks seem to perform worse than before, even if were part of prior training and evaluation data sets. Also the explainability part of the Vision ML Models gets difficult and tracing the weights, biases and hyperparameters to the lineage of Synthetic Data vs Real Data became almost impossible to do.
The benefits of using Synthetic Image Data are:
- Addressing Data Scarcity and Privacy
- Cost effectiveness and customization
- Enhancing Model Performance
- Overcoming the Simulation-to-Reality gap
- Supplying adequate data for Machine Learning models especially in Sparse Data Environments.
However, as promising is the approach with Synthetic Image Data, I have observed the following issues.
领英推荐
For the first feedback loop i.e. generating Image data that surpasses the quality of the real data it was trained on.
- Heavy dependence on the quality of the real training data. If the training data is not diverse or accurate enough, the synthetic data generated will likely inherit these limitations, making it difficult to surpass the quality of the original data.
- Lots of computing power are required.
- Synthetic data can inadvertently contain biases present in the training data. Ensuring that synthetic data is free from such biases and is truly representative is a significant challenge.
- Loss of detail and nuance. While synthetic data can replicate general patterns and structures found in real data, capturing the subtle nuances and unique outliers of real-world scenarios is challenging. This limitation can make synthetic data less effective for tasks where fine details are crucial.
- Verification and validation: Ensuring that synthetic data is accurate and reliable involves rigorous validation against real-world data. This process can be time-consuming and complex, particularly when the synthetic data is intended to be better than the real data it was modeled after. Verification steps are necessary to confirm that the synthetic data maintains fidelity to real-world conditions.
For the second feedback loop i.e. using the high quality Synthetic Data to self-improve model performance based on thresholds for model output metrics, the following issues were noted:
The Machine Learning Models for Vision Intelligence might become "addicted" to Synthetic Image Data, potentially leading to a rejection or underperformance on Real Image Data.
Overfitting: If the synthetic data does not adequately capture the variability and complexity of real-world data. The Model learns the details and noise in the training data (in this case, synthetic data) to an extent that it negatively impacts the performance of the model on new, unseen data (real-world data).
Thresholds and Performance Metrics: Models trained on synthetic data might develop different thresholds for classification metrics such as accuracy, precision, and recall. If these thresholds are optimized for the characteristics of synthetic data, they might not be appropriate for real data. This misalignment can lead to higher error rates when the model is applied to real data, as the decision boundaries or thresholds that worked well for synthetic data do not translate effectively to real-world data. This is seen particularly in the Driver Drowsiness Solution as the Loss metric fluctuates quite much.
Generalization Issues: The primary risk of training exclusively or predominantly with Synthetic Data is that the model does not generalize well to Real Image data. This happens if the Synthetic Image Data is not sufficiently representative of the real-world scenarios the model will encounter. If the differences are significant, the model might effectively "reject" real data by performing poorly on it, as it has learned to recognize and respond to the patterns and noise specific to the synthetic data.
Domain adaptation: When the Vision Machine Learning Models are trained primarily on synthetic data, they are being optimized for the source domain (Synthetic Data) characteristics. If these characteristics do not adequately represent the real-world scenarios (target domain), the model may perform poorly when exposed to real data. This is a classic example of a domain shift, where the distribution of the training data (Synthetic) is different from that of the deployment data (Real).
To address the above issues, I plan to study and implement the following mitigation strategies.
Integrate Real Data with Synthetic Data: I plan to use a combination of Synthetic Image and Real Image data during training to ensure the model learns to generalize across both types of data. I will be experimenting with different mix ratios.
Validate on Real Data: I plan to regularly validate the Machine Learning Model's performance on curated high quality real-world Image data to ensure it maintains high accuracy and generalizability. This is both a regression test and a validation test step.
Domain Randomization: To address the issue with Domain Adaptation, I plan to use techniques like domain randomization in synthetic data generation to introduce more variability and curriculum learning, which can help models generalize better to real-world data
Continuous Monitoring and Updating: With a focus on Real Image Data, I plan to continuously monitor the model's performance on real-world data and update the training dataset and model parameters as necessary to adapt to changes in real-world conditions and achieve desired thresholds for Model output metrics.
I further plan to explore and understand innovative approaches to mitigate the risks with Synthetic Data addiction such as Google Project Dreambooth, MIT Research on Synthetic Data Generation [https://news.mit.edu/2022/synthetic-datasets-ai-image-classification-0315] and using GAN's [https://www.dhirubhai.net/pulse/generative-ai-synthetic-data-changing-landscape-aruna-pattam/].
I will update on how these go in a follow up post.
Senior Mobile Developer | Building Top-Tier iOS & Android Apps
5 个月Thanks for the article, very helpful!
Product Leader | AI Enthusiast | Data Strategist
5 个月What exactly do you mean when you say the synthetic data is "better" than the real data? Are you concerned that this might train the models in such a way that they will not perform well when the actual data they are evaluating doesn't match what they were trained on?
Exciting read! Synthetic data definitely has its pros and cons. ??