登录查看更多内容

Synthetic Data for AI Models: A New Path Forward

D&G Solutions

Bring us your Goliaths

发布日期: 2024年1月22日

The Department of Homeland Security’s Science and Technology Directorate recently issued a solicitation for synthetic data solutions that can “generate synthetic data that models and replicates the shape and patterns of real data, while safeguarding privacy.” By using synthetic data, agencies may be able to train machine learning models in cases where there is no real-world data available or where data is limited. Another main use case for this type of data could be in instances where using certain data to train AI/ML models might violate privacy, security, and/or civil liberties. This post will elaborate on what synthetic data is, why it matters, its potential benefits and risks, and its usefulness in the federal space.

?Overview and Background

Balancing the need for accurate data models with the need to preserve privacy and security is always a challenge. In many instances there is a tradeoff, meaning that we sacrifice some utility for security or vice versa. These challenges can partially be resolved through adding differential privacy to datasets that train AI models, but on a government level there are oftentimes access issues due to restrictions around sharing and using sensitive information making differential privacy practices unfeasible.[1] Synthetic data is a way to bridge the gap between utility and security. ?So, what is synthetic data and how does it work?

?The Mechanics of Synthetic Data

Synthetic data typically uses a base dataset of actual historical events, transactions, etc. and creates a synthetic representation of that data, building on it to create a robust synthetic dataset. It is generated using advanced machine learning algorithms and generative AI techniques. The result is a dataset that is similar in structure, features, and characteristics to the same data found in real-world applications. The following graphic shows the general process for synthetic data generation.

Image from: What is synthetic data and how to create it? (

The process of synthetic data generation requires multiple steps as seen above and can be done using three main methods; machine-learning based models, agent-based models, and hand-engineered methods. Let’s take a closer look at one example of each of these methods. ?

Machine-learning based Model: Generative Adversarial Network (GAN):

With a GAN model, data is generated using a two-part neural network system, one part generates new data and the other evaluates and classifies the quality of that data. This method is used largely for time series data, images, and text data.

Agent-based Model: Market Simulation:

In the example of a Market Simulation, an agent-based model can simulate the behavior of individual traders (for which there is data available) in a financial market. The model will consider each trader’s strategies and risk preferences to create synthetic financial market data. This could be used for testing algorithms and risk management strategies.

Hand-Engineered Model: Rule-Based data generation:

Hand-engineered models typically use rules and algorithms to generate synthetic data. In this example, synthetic data is created based on a set of predefined rules and conditions designed by the data scientist. This works well when the data is well understood and can be represented using statistical modeling.

Regardless of the method used, the outcome is a synthetic dataset based on the original dataset. The graphic below gives a visual representation of a sample original dataset and its synthetic counterpart.

Benefits and Risks of Synthetic Data

One of the values in using synthetic data is its ability to represent or predict hypothetical situations beyond what real-world data represents. Other benefits include bias reduction, increased data variety, and enhanced privacy and security. These have the potential to contribute to the value of AI in a positive way. The below graphic describes some of these benefits in more detail.

Despite its long list of benefits, the use of synthetic data also comes with risks. These could include limited generalization, a lack of realism, and the potential for its use to cause ethical concerns (in some industries). Perhaps the biggest risk is the potential for model degradation over time.

Model degradation happens when AI models are trained on generated or synthetic data from older models. New models can become too reliant on patterns existing in the generated data. If the AI model is not periodically synced to underlying real-world data, the AI model may degrade at a quicker rate and can collapse over time, leading to lower-quality training set data. For some instances where there is no real-world data to sync with the AI model, then utilizing synthetic data alone may be the only option and may outweigh the cost of model degradation. In these instances, it is important that the synthetic data be as accurate and robust as possible and updated frequently.

In order for synthetic data to provide the most benefit, it is important to understand these potential benefits and risks. The best solutions and application of synthetic data will take into account the risks associated with its use. This will prove to be particularly important when using synthetic data practices in the federal government. Next, we will look at a couple of potential use cases for synthetic data in the federal government.

Synthetic Data in the Federal Government: Use Cases

On a federal level, synthetic data could prove to be particularly useful for agencies like the Cybersecurity and Infrastructure Security Agency (CISA), who could use it to develop realistic exercise scenarios and model cyber and physical environments. Another example of its usefulness could be using synthetic data to train autonomous systems.

领英推荐

World Models, GenSQL for Database Analysis, Synthetic…

Open Data Science Conference (ODSC) 7 个月前

Open-Source Synthetic Data Tools, AI Voice Agents…

Open Data Science Conference (ODSC) 7 个月前

ODSC's AI Weekly Recap: Week of July 12th

Open Data Science Conference (ODSC) 7 个月前

When training any autonomous systems, if an AI model were only trained on historical or actual events using real datasets, those systems may not be able to identify possible outliers or predict hypothetical scenarios for which there is no real-world data. This could lead to system failure if/when such a scenario occurs. However, if that model were also trained on synthetic data, it could better predict appropriate outcomes for scenarios in which there is no real-world data or where data is limited.

Of course, there are other use cases for synthetic data, particularly in healthcare, disaster relief and preparedness, risk management, robotics, banking and finance, agriculture, and a number of others.

Conclusion

Incorporating more synthetic data into federal AI practices has the potential to bridge the gap between utility and security. Sectors across the federal space can benefit from the appropriate incorporation of synthetic data into their AI practices. The solicitation is bound to bring about a number of valuable ideas, technologies, and ways that the federal government can integrate synthetic data into their AI models, maximizing their potential while minimizing risk. We are eager to see how the government moves forward with harnessing the power of synthetic data generation.

SOURCES:

Differential Privacy for Neural Networks. Adding noise during vs after training - Information Security Stack Exchange

Model collapse explained: How synthetic training data breaks AI (techtarget.com)

US DHS Solicits Synthetic Data Expertise for AI Training (healthcareinfosecurity.com)

The Pros And Cons Of Using Synthetic Data For Training AI (forbes.com)

Agencies eye synthetic data to help train and test AI - Government Executive (govexec.com)

What is synthetic data and how to create it? (leewayhertz.com)

What is Synthetic Data? Use Cases & Benefits in 2024 (aimultiple.com)

Test Automation Forum - Test data management using AI-powered synthetic data generators

AI Needs a New Developer Stack! | Fiddler AI Blog

How Do You Generate Synthetic Data? (sungsoo.github.io)

Synthetic Data 101: What is it, how it works, and what it's used for (syntheticus.ai)

NGP Capital | Unleashing the power of synthetic data: exploring applications and market opportunity

Synthetic Data Generation: Definition, Types, Techniques, & Tools (turing.com)

[1] Differential privacy refers to the process of adding noise to the result of a query about a dataset to mask information about any individual in a dataset. In an ML pipeline, noise can be inserted into the dataset at a number of different points, the training data, during training, on the trained model, or on the model outputs. Challenges can arise though as adding more noise makes the query outputs less valuable. Moreover, with larger numbers of queries, the noise required to mask sensitive data would approach infinity making this not a feasible approach in many instances.

Synthetic Data for AI Models: A New Path Forward

D&G Solutions

Bring us your Goliaths

?Overview and Background

?The Mechanics of Synthetic Data

Benefits and Risks of Synthetic Data

Synthetic Data in the Federal Government: Use Cases

领英推荐

Conclusion

Monday Morning Brew

1,774 位关注者

D&G Solutions的更多文章

社区洞察

其他会员也浏览了

Miguel Martinez's Journey in Data and AI: A Path of Curiosity, Risk, and Passion

Navigating the AI Landscape: RAG, Rockset's New Chapter, and the Power of Text Search

Early Insights from the Monetizing Generative AI Roundtable Survey

Why Your AI Strategy is Only as Good as Your Data

Building AI on a Foundation of Crawled Data: Exploring the Impact and Implications of Common Crawl

Agentic AI Independence, Dynamic Data, and Hallucinations: AI in 2025

Navigating AI Challenges: Strategies to Overcome Artificial Intelligence Hurdles

Data & AI newsletter - Q1 2024 edition

Predictions for AI in strategic insights in 2024

The Role of AI in Big Data

?Overview and Background

?The Mechanics of Synthetic Data

Benefits and Risks of Synthetic Data

Synthetic Data in the Federal Government: Use Cases

领英推荐

Conclusion

Monday Morning Brew

1,774 位关注者

D&G Solutions的更多文章

D&G Solutions Welcomes Cami Mercado as New Vice President of Defense

Introducing D&G Solutions: A Rebrand That Reflects Our Evolution

How Predictive Analytics is Transforming Defense Logistics Planning

Strategies for Managing Supply Chain Risks in Defense Contracting

Building Robust IT Enterprise Solutions for Critical Infrastructure: A Strategic Approach

Remote Work Chronicles: Tips for Productivity and Wellbeing

Harnessing AI for Enhanced Supply Chain Risk Management in Defense Logistics

Empowering Managers to Lead and Inspire Teams

Transforming the Battlefield with AI

What’s in a name? Values that Drive Us

社区洞察

其他会员也浏览了

Miguel Martinez's Journey in Data and AI: A Path of Curiosity, Risk, and Passion

Navigating the AI Landscape: RAG, Rockset's New Chapter, and the Power of Text Search

Early Insights from the Monetizing Generative AI Roundtable Survey

Why Your AI Strategy is Only as Good as Your Data

Building AI on a Foundation of Crawled Data: Exploring the Impact and Implications of Common Crawl

Agentic AI Independence, Dynamic Data, and Hallucinations: AI in 2025

Navigating AI Challenges: Strategies to Overcome Artificial Intelligence Hurdles

Data & AI newsletter - Q1 2024 edition

Predictions for AI in strategic insights in 2024

The Role of AI in Big Data