Synthetic Data for AI Models: A New Path Forward
The Department of Homeland Security’s Science and Technology Directorate recently issued a solicitation for synthetic data solutions that can “generate synthetic data that models and replicates the shape and patterns of real data, while safeguarding privacy.” By using synthetic data, agencies may be able to train machine learning models in cases where there is no real-world data available or where data is limited. Another main use case for this type of data could be in instances where using certain data to train AI/ML models might violate privacy, security, and/or civil liberties. This post will elaborate on what synthetic data is, why it matters, its potential benefits and risks, and its usefulness in the federal space.
?Overview and Background
Balancing the need for accurate data models with the need to preserve privacy and security is always a challenge. In many instances there is a tradeoff, meaning that we sacrifice some utility for security or vice versa. These challenges can partially be resolved through adding differential privacy to datasets that train AI models, but on a government level there are oftentimes access issues due to restrictions around sharing and using sensitive information making differential privacy practices unfeasible.[1] Synthetic data is a way to bridge the gap between utility and security. ?So, what is synthetic data and how does it work?
?The Mechanics of Synthetic Data
Synthetic data typically uses a base dataset of actual historical events, transactions, etc. and creates a synthetic representation of that data, building on it to create a robust synthetic dataset. It is generated using advanced machine learning algorithms and generative AI techniques. The result is a dataset that is similar in structure, features, and characteristics to the same data found in real-world applications. The following graphic shows the general process for synthetic data generation.
The process of synthetic data generation requires multiple steps as seen above and can be done using three main methods; machine-learning based models, agent-based models, and hand-engineered methods. Let’s take a closer look at one example of each of these methods. ?
Machine-learning based Model: Generative Adversarial Network (GAN):
With a GAN model, data is generated using a two-part neural network system, one part generates new data and the other evaluates and classifies the quality of that data. This method is used largely for time series data, images, and text data.
Agent-based Model: Market Simulation:
In the example of a Market Simulation, an agent-based model can simulate the behavior of individual traders (for which there is data available) in a financial market. The model will consider each trader’s strategies and risk preferences to create synthetic financial market data. This could be used for testing algorithms and risk management strategies.
Hand-Engineered Model: Rule-Based data generation:
Hand-engineered models typically use rules and algorithms to generate synthetic data. In this example, synthetic data is created based on a set of predefined rules and conditions designed by the data scientist. This works well when the data is well understood and can be represented using statistical modeling.
Regardless of the method used, the outcome is a synthetic dataset based on the original dataset. The graphic below gives a visual representation of a sample original dataset and its synthetic counterpart.
Benefits and Risks of Synthetic Data
One of the values in using synthetic data is its ability to represent or predict hypothetical situations beyond what real-world data represents. Other benefits include bias reduction, increased data variety, and enhanced privacy and security. These have the potential to contribute to the value of AI in a positive way. The below graphic describes some of these benefits in more detail.
Despite its long list of benefits, the use of synthetic data also comes with risks. These could include limited generalization, a lack of realism, and the potential for its use to cause ethical concerns (in some industries). Perhaps the biggest risk is the potential for model degradation over time.
Model degradation happens when AI models are trained on generated or synthetic data from older models. New models can become too reliant on patterns existing in the generated data. If the AI model is not periodically synced to underlying real-world data, the AI model may degrade at a quicker rate and can collapse over time, leading to lower-quality training set data. For some instances where there is no real-world data to sync with the AI model, then utilizing synthetic data alone may be the only option and may outweigh the cost of model degradation. In these instances, it is important that the synthetic data be as accurate and robust as possible and updated frequently.
In order for synthetic data to provide the most benefit, it is important to understand these potential benefits and risks. The best solutions and application of synthetic data will take into account the risks associated with its use. This will prove to be particularly important when using synthetic data practices in the federal government. Next, we will look at a couple of potential use cases for synthetic data in the federal government.
Synthetic Data in the Federal Government: Use Cases
On a federal level, synthetic data could prove to be particularly useful for agencies like the Cybersecurity and Infrastructure Security Agency (CISA), who could use it to develop realistic exercise scenarios and model cyber and physical environments. Another example of its usefulness could be using synthetic data to train autonomous systems.
领英推荐
When training any autonomous systems, if an AI model were only trained on historical or actual events using real datasets, those systems may not be able to identify possible outliers or predict hypothetical scenarios for which there is no real-world data. This could lead to system failure if/when such a scenario occurs. However, if that model were also trained on synthetic data, it could better predict appropriate outcomes for scenarios in which there is no real-world data or where data is limited.
Of course, there are other use cases for synthetic data, particularly in healthcare, disaster relief and preparedness, risk management, robotics, banking and finance, agriculture, and a number of others.
Conclusion
Incorporating more synthetic data into federal AI practices has the potential to bridge the gap between utility and security. Sectors across the federal space can benefit from the appropriate incorporation of synthetic data into their AI practices. The solicitation is bound to bring about a number of valuable ideas, technologies, and ways that the federal government can integrate synthetic data into their AI models, maximizing their potential while minimizing risk. We are eager to see how the government moves forward with harnessing the power of synthetic data generation.
?
SOURCES:
[1] Differential privacy refers to the process of adding noise to the result of a query about a dataset to mask information about any individual in a dataset. In an ML pipeline, noise can be inserted into the dataset at a number of different points, the training data, during training, on the trained model, or on the model outputs. Challenges can arise though as adding more noise makes the query outputs less valuable. Moreover, with larger numbers of queries, the noise required to mask sensitive data would approach infinity making this not a feasible approach in many instances.
?