The Good, Bad and Ugly aspects of using Synthetic Image Data for continual self- improvement of Computer Vision Models

Harsha Srivatsa

Founder-CPO @ MentisBoostAI | AI Product Leadership, Data Architecture, IoT, Impact Innovation, Systems Thinking | I help visionary companies build standout AI Products | Ex - Apple, Accenture, Cognizant, AT&T, Verizon

发布日期: 2024年4月27日

I have worked on two "hard and wicked" problems that leverage AI capabilities very well.

Smart Surveillance Camera Systems that detect fire and explosions
Driver Drowsiness and Distraction Detection Systems.

I read the excellent topic coverage for Synthetic Data in this paper "Best Practices and Lessons Learned on Synthetic Data for Language Models" [https://arxiv.org/pdf/2404.07503].

Towards the end of the paper, there was mention of future innovation work for "Emergent self-improvement capability" with two interesting points:

- If a Vision ML model use Generative AI capabilities to generate higher-quality data than its original training set, it could potentially bootstrap its own performance by iteratively learning from the enhanced synthetic data.

- This self-improvement capability could lead to the emergence of more advanced AI systems that can autonomously refine their skills and knowledge over time.

- Although recent work shows encouraging progress in this direction, the upper bound of self-improvement and the underlying reasons for its effectiveness remain open questions.

For my work on Driver Drowsiness and Distraction Detection System, I set about designing a ML Pipeline system that uses two feedback - action loops. The first loop generated Synthetic Image Data better than the real data that it was trained on thus enabling it to continually improve the data quality by itself. Second, the Machine Learning models get into a self-improvement mode recognizing the higher quality of input data and report on the output metrics so that the improvement with feedback-generation loop with even more higher quality Synthetic data becomes a self-improving mechanism.

However, there are good, bad, and ugly aspects to using Synthetic Data as I found out the hard way.

Good: The Vision Machine Learning Models recognize the higher quality Synthetic Image data and report better output metrics for Accuracy, Precision, Loss, Recall and F1 Score.

Bad: With continual input of Synthetic Image Data, the ML Models seems to be hooked on or addicted to Synthetic Image Data. Based on the programmatic thresholds for output metrics, they seem to reject real data or prefer to accept Synthetic Image Data only.

Ugly Part: With the introduction of real data, manual annotations and quality checks seem to perform worse than before, even if were part of prior training and evaluation data sets. Also the explainability part of the Vision ML Models gets difficult and tracing the weights, biases and hyperparameters to the lineage of Synthetic Data vs Real Data became almost impossible to do.

The benefits of using Synthetic Image Data are:

- Addressing Data Scarcity and Privacy

- Cost effectiveness and customization

- Enhancing Model Performance

- Overcoming the Simulation-to-Reality gap

- Supplying adequate data for Machine Learning models especially in Sparse Data Environments.

However, as promising is the approach with Synthetic Image Data, I have observed the following issues.

Salesforce 4 个月前

How Can Businesses Embrace and Utilise AI to Enhance…

Scott Jones 10 个月前

Hallucination: What Makes GenAIs Lie and What We Can…

Peterson Technology Partners 10 个月前

For the first feedback loop i.e. generating Image data that surpasses the quality of the real data it was trained on.

- Heavy dependence on the quality of the real training data. If the training data is not diverse or accurate enough, the synthetic data generated will likely inherit these limitations, making it difficult to surpass the quality of the original data.

- Lots of computing power are required.

- Synthetic data can inadvertently contain biases present in the training data. Ensuring that synthetic data is free from such biases and is truly representative is a significant challenge.

- Loss of detail and nuance. While synthetic data can replicate general patterns and structures found in real data, capturing the subtle nuances and unique outliers of real-world scenarios is challenging. This limitation can make synthetic data less effective for tasks where fine details are crucial.

- Verification and validation: Ensuring that synthetic data is accurate and reliable involves rigorous validation against real-world data. This process can be time-consuming and complex, particularly when the synthetic data is intended to be better than the real data it was modeled after. Verification steps are necessary to confirm that the synthetic data maintains fidelity to real-world conditions.

For the second feedback loop i.e. using the high quality Synthetic Data to self-improve model performance based on thresholds for model output metrics, the following issues were noted:

The Machine Learning Models for Vision Intelligence might become "addicted" to Synthetic Image Data, potentially leading to a rejection or underperformance on Real Image Data.

Overfitting: If the synthetic data does not adequately capture the variability and complexity of real-world data. The Model learns the details and noise in the training data (in this case, synthetic data) to an extent that it negatively impacts the performance of the model on new, unseen data (real-world data).

Thresholds and Performance Metrics: Models trained on synthetic data might develop different thresholds for classification metrics such as accuracy, precision, and recall. If these thresholds are optimized for the characteristics of synthetic data, they might not be appropriate for real data. This misalignment can lead to higher error rates when the model is applied to real data, as the decision boundaries or thresholds that worked well for synthetic data do not translate effectively to real-world data. This is seen particularly in the Driver Drowsiness Solution as the Loss metric fluctuates quite much.

Generalization Issues: The primary risk of training exclusively or predominantly with Synthetic Data is that the model does not generalize well to Real Image data. This happens if the Synthetic Image Data is not sufficiently representative of the real-world scenarios the model will encounter. If the differences are significant, the model might effectively "reject" real data by performing poorly on it, as it has learned to recognize and respond to the patterns and noise specific to the synthetic data.

Domain adaptation: When the Vision Machine Learning Models are trained primarily on synthetic data, they are being optimized for the source domain (Synthetic Data) characteristics. If these characteristics do not adequately represent the real-world scenarios (target domain), the model may perform poorly when exposed to real data. This is a classic example of a domain shift, where the distribution of the training data (Synthetic) is different from that of the deployment data (Real).

To address the above issues, I plan to study and implement the following mitigation strategies.

Integrate Real Data with Synthetic Data: I plan to use a combination of Synthetic Image and Real Image data during training to ensure the model learns to generalize across both types of data. I will be experimenting with different mix ratios.

Validate on Real Data: I plan to regularly validate the Machine Learning Model's performance on curated high quality real-world Image data to ensure it maintains high accuracy and generalizability. This is both a regression test and a validation test step.

Domain Randomization: To address the issue with Domain Adaptation, I plan to use techniques like domain randomization in synthetic data generation to introduce more variability and curriculum learning, which can help models generalize better to real-world data

Continuous Monitoring and Updating: With a focus on Real Image Data, I plan to continuously monitor the model's performance on real-world data and update the training dataset and model parameters as necessary to adapt to changes in real-world conditions and achieve desired thresholds for Model output metrics.

I further plan to explore and understand innovative approaches to mitigate the risks with Synthetic Data addiction such as Google Project Dreambooth, MIT Research on Synthetic Data Generation [https://news.mit.edu/2022/synthetic-datasets-ai-image-classification-0315] and using GAN's [https://www.dhirubhai.net/pulse/generative-ai-synthetic-data-changing-landscape-aruna-pattam/].

I will update on how these go in a follow up post.

Andrew Gordeev

Senior Mobile Developer | Building Top-Tier iOS & Android Apps

5 个月

Thanks for the article, very helpful!

1 次回应

Kelly Bakhos

Product Leader | AI Enthusiast | Data Strategist

5 个月

What exactly do you mean when you say the synthetic data is "better" than the real data? Are you concerned that this might train the models in such a way that they will not perform well when the actual data they are evaluating doesn't match what they were trained on?

1 次回应

Abhinav Shanker

5 个月

Exciting read! Synthetic data definitely has its pros and cons. ??

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

The Good, Bad and Ugly aspects of using Synthetic Image Data for continual self- improvement of Computer Vision Models

Harsha Srivatsa

Founder-CPO @ MentisBoostAI | AI Product Leadership, Data Architecture, IoT, Impact Innovation, Systems Thinking | I help visionary companies build standout AI Products | Ex - Apple, Accenture, Cognizant, AT&T, Verizon

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

GenAI: Necessary but Insufficient

Emerging Innovations in Data and AI

Technical Roadblocks in Adopting AI You Should Know

Is AI Doomed to Collapse Under Its Own Weight? Unpacking the ‘Dead Data’ Crisis

How AI integrates into our data design process

Analysis and Strategy to Realize the Orion AI Model

Comprehensive guide to building a custom Generative AI enterprise app

VA Jumps into the AI Regulation Space with new Bill

Towards Responsible AI: Insights from Global Regulatory Discussions

Explore the Future with Gen AI: Your Weekly Passport to Innovation!

领英推荐

AI Alchemy: Leveraging the Power of Creativity in AI Product Management

2024年9月29日

From Concept to Creation: Six Essential Design Principles for Generative AI Systems Design

2024年9月25日

A Comprehensive Primer on AI Systems Thinking

2024年9月22日

Crafting the AI Narrative: How Strategy Storytelling can be leveraged to develop effective AI Product Strategies

2024年9月22日

The Art of Storytelling in AI Product Innovation: A Comprehensive Guide for AI Product Leaders

2024年9月13日

AI Product Leader 2.0: Evolving Product Leadership Skills in an AI-driven World

2024年9月13日

Mastering AI Product Leadership: How the Right 30-60-90 Day Plan Can Ensure Success and Satisfaction

2024年9月9日

Bridging the Gap: How AI Product Managers can translate Stakeholder Speak into Strategy and Action

2024年9月3日

Good Customer, Bad Customer: How Continuous Customer Discovery Drives Product Evolution

2024年8月27日

The Mental Health Crisis in India: How AI Can Pave the Way for Better Support Systems

2024年8月26日

社区洞察

其他会员也浏览了

GenAI: Necessary but Insufficient

Emerging Innovations in Data and AI

Technical Roadblocks in Adopting AI You Should Know

Is AI Doomed to Collapse Under Its Own Weight? Unpacking the ‘Dead Data’ Crisis

How AI integrates into our data design process

Analysis and Strategy to Realize the Orion AI Model

Comprehensive guide to building a custom Generative AI enterprise app

VA Jumps into the AI Regulation Space with new Bill

Towards Responsible AI: Insights from Global Regulatory Discussions

Explore the Future with Gen AI: Your Weekly Passport to Innovation!