登录查看更多内容

Swimming through a quagmire of synthetic garbage data

Justin Lyon

CEO, Simudyne helps institutions solve complex problems and make better decisions. Barclays Techstars '17

发布日期: 2023年6月27日

Exploring Synthetic Data Generation

The growing need for synthetic data has led to the development of various techniques to generate realistic and privacy-preserving datasets. These techniques include:

1.???? Rule-based

2.???? Statistical model-based

3.???? Deep generative models

4.???? Simulation- & agent-based models

Rule-based

The simplest approach to generating synthetic data is using explicit hard-coded rules and conditions. This rule-based approach relies on predefined logic and decision-making processes to mimic the behaviour of real-world data. Rule-based methods can be useful when the data generation process is well-understood and can be accurately represented using deterministic rules. For example, image augmentation techniques can generate synthetic data by rotating, translating and adding noise to images. Using augmented images during training has been shown to improve accuracy in certain computer vision tasks. Rule-based approaches can incorporate more advanced algorithms and techniques to enhance the realism and complexity of the generated data. It may involve incorporating machine learning or expert systems to develop intelligent rules that can adapt and evolve based on the input data. For example, an intelligent rule-based method could be used to augment images to represent different lighting scenarios by using complex models that capture how shadows change under different light. This could make a computer vision model robust to low-lighting environments, for example. Intelligent rule-based methods are effective when the data generation process involves complex relationships and patterns that can be learned from existing data.

Statistical model-based

Rather than relying on hard-coded rules, synthetic data can be generated by looking at the statistical properties of the underlying dataspace. This statistical model-based approach to data generation leverages statistical models to generate data that closely resembles the statistical properties and patterns observed in real-world data. It involves fitting a statistical model, such as a Gaussian mixture model or Gaussian process, to the existing data and then sampling from the approximated distributions to give new data points that exhibit similar characteristics. Statistical model-based methods are useful when the data generation process can be described using probabilistic distributions and when capturing the statistical properties of the data is crucial.

Deep generative models

Often, the underlying dataspace is high-dimensional. Deep learning models, based around artificial neural networks, are well designed for high dimensional datasets and several deep learning models have been shown to be highly successful as generative models. These include variational auto-encoders (VAEs), normalising flows (NFs), deep diffusion models (DDM), and generative adversarial networks (GANs). There are significant differences between these different methods including whether they rely on learning a probabilistic representation of the dataspace (as with VAEs and NFs) and whether the approximated dataspace is accessible or implicit (as with GANs). For example, GANs typically consist of two components: a generator and a discriminator. The generator learns to generate synthetic data that resembles the real data, while the discriminator learns to distinguish between real and synthetic data. The generator and discriminator are trained simultaneously in a competitive manner, improving the quality and realism of the generated data over time. GANs are powerful for generating complex and high-dimensional data with fine-grained details. However, the generator does not explicitly model the underlying distribution of the data but instead is trained to reproduce samples similar to the training data.

On the other hand, explicit models attempt to model the entire dataspace. For example, autoencoders use an encoder-decoder structure to 'encode’ the data into a low-dimensional space, often called the latent space, and then ‘decode’ the latent space back into the original data point as accurately as possible. Sampling from the latent space allows for the generation of new synthetic data. Other methods learn probabilistic representations (such as VAEs) or aim to faithfully represent the data-generating process rather than the dataspace (such as DDMs). Deep generative models have attracted significant attention in recent years due to their aptitude in generating convincing images, such as the DALL-E models, and text, such as GPT models. However, deep generative models are typically ‘black boxes’ or ‘rule-free’ as it is typically not possible to explain why a certain example of synthetic data is produced or whether it is factually accurate. This raises significant challenges as these models are increasingly relied on in production environments, for example, real-time fraud detection.

Deep generative models are well-suited for generating tabular synthetic data due to their ability to capture complex data distributions. By training on real payment transaction data, a model such as a GAN or VAE can generate synthetic data that closely resembles the original dataset. However, caution must be exercised to ensure the protection of private customer data. Precautions, such as properly anonymizing and sanitizing the real data, implementing differential privacy techniques, and rigorously evaluating the models’ ability to preserve privacy, should be taken. Furthermore, the output of the data should be carefully checked to ensure that it is well representative of the training data, does not contain any hidden biases, and is coherent.

Simulation- and agent-based models (ABM)

Unlike deep generative models, simulation models use expert understanding of the underlying data generation process to reproduce the data-generation environment. For example, physics simulators use equations that describe the motion of fluid particles to generate realistic water effects, as used in CGI. A sub-class of simulators are agent-based models, which simulate the behaviour and interactions of autonomous agents within a given environment. These models capture individual-level decision-making processes, agent attributes, and the interactions among agents from a set of pre-defined rules. ABMs are particularly useful when the focus is on understanding emergent system-level behaviours that arise from the interactions of individual agents. They excel at capturing the temporal and dynamic aspects of real-world systems, making them valuable for generating synthetic data that reflects complex relationships and interdependencies, whilst remaining interpretable, explainable, and reliable.

Agent-Based Models (ABMs) can also generate tabular synthetic data for a payment processor. ABMs offer the advantage of capturing the dynamic behaviour of the payment system, including fraud detection, transaction volumes, and network effects. To protect private customer data, ABMs can be designed with privacy-enhancing mechanisms, such as using synthetic identities and transaction profiles. Additionally, careful consideration must be given to the calibration of agent behaviours and interaction rules to ensure the generated data does not reveal sensitive information and that the synthetic data realistically reproduces known features in the data.

Summary of synthetic data generation techniques

Each approach has its strengths and weaknesses, and the choice of method depends on the specific data generation requirements and the characteristics of the underlying data. Rule-based and intelligent rule-based methods are suitable when the data generation process can be represented using explicit rules or learned patterns. Statistical model-based approaches are effective for capturing statistical properties of the data. Deep Generative Models excel in generating high-quality synthetic data with complex patterns and details. ABMs are valuable for studying emergent behaviours and interactions among agents and reproducing synthetic data with a high degree of control.

领英推荐

Technologies Powering AI Tools and Breakthrough Use…

VOLANSYS (An ACL Digital Company) 1 年前

Urban Simulation + AI: Integrating LLMs in Household…

URBANLY 11 个月前

Machine Learning Algorithms in Depth - HockeySick #18

Miko Pawlikowski ??? 6 个月前

The selection of the most appropriate method depends on the specific use case and the desired characteristics of the synthetic data. In practice, a combination of the methods is most often used. In almost all production use-cases in financial services, clients benefit from a multi-method approach which includes ABM.

Multi-method synthetic data generation

When generating tabular synthetic data in financial services, while ensuring privacy, the most robust approach would be to combine simulation-based models such as ABMs with other methods to generate synthetic data. Here are a few comments on how the various techniques can be used in conjunction with ABMs.

Rule-based generators

Rule-based generators involve defining specific rules and heuristics for generating synthetic data. These rules dictate the behaviours, attributes, and interactions of agents within the ABM. By carefully designing these rules, it is possible to generate synthetic data that reflects the characteristics and patterns observed in real-world data. Rule-based generators allow for fine-grained control over the data generation process and can be tailored to specific requirements.

Statistical approaches

Statistical methods can be employed to generate synthetic data within the context of ABMs. This involves analysing the statistical properties and patterns present in real-world data and using this information to generate synthetic datasets. Techniques such as bootstrapping, resampling, and Monte Carlo simulations can be applied to mimic the statistical distribution and variability observed in the original data. Statistical approaches are particularly useful when the focus is on preserving the statistical characteristics of the data rather than capturing intricate details.

Machine learning techniques

Machine learning algorithms can be integrated into ABMs to enhance improve the realism of synthetic data. For example, supervised learning techniques like decision trees, random forests, or neural networks, can be used to learn patterns and relationships from real data and these can be integrated into the simulation to replace model components where there is high uncertainty, such as customer behaviours. Other techniques apply to the simulation output such as using neural networks to calibrate an ABM. By leveraging the power of machine learning, ABMs can generate synthetic data that closely resembles the real data and captures complex patterns and relationships.

Simulation-based inference with neural networks for ABM calibration

Simulation-based inference is a valuable machine learning technique for calibrating simulations to real data and observations. Variations on this technique combine neural networks, such as neural density estimators, with embedding networks to translate high-dimensional observations into calibrated simulations. Here's how it can be applied to an agent-based model (ABM) for generating realistic synthetic data using a payment processor as an example:

Define Calibration Objectives: The first step is to establish the calibration objectives, which involve determining the key metrics or features that the synthetic data should accurately replicate. For a payment processor, these objectives may include capturing transaction patterns, volume distributions, fraud rates, or other relevant characteristics of the payment system. This is the test metric and used to evaluate the accuracy of the overall calibration process.
Model Specification: Develop an agent-based model that represents the payment processor system, incorporating various agent behaviours, rules, and interactions. This model should reflect the key aspects and dynamics of the real-world payment process, such as transaction processing, fraud detection, and risk management.
Data Collection: Gather real-world data from the payment processor system, including transaction records, customer information, and relevant contextual data. This data will serve as the basis for calibration and comparison with the synthetic data generated by the ABM.
Calibration Process: Using simulation-based inference, the ABM is calibrated to the collected data. First, the upper and lower bounds for possible parameter values is determined by discussing with domain experts. Next, parameter sets are sampled from these ranges and simulation output is generated. A neural density estimator, for example, is then trained to infer the probability distribution of parameters depending on either a set of pre-defined features or the output from an embedding network. The trained network can then provide parameter estimates for any new passed instance of historical data.
Validation and Iteration: Once the calibration process is complete, it is important to validate the calibrated ABM by comparing its performance against independent validation datasets or additional real-world data. This step ensures that the ABM accurately captures the underlying dynamics and behaviour of the payment processor system. Good practice is to use the test metric, defined in step 1, to which the model was not exposed.
Synthetic Data Generation: With the calibrated ABM, synthetic data can be generated by running simulations with the optimized parameter values. The synthetic data will reflect the realistic patterns and characteristics of the payment processor system, providing a valuable resource for testing, analysis, and decision-making.

Simulation-based inference enables the calibration of agent-based models to produce synthetic data that closely matches real-world data. By iteratively adjusting the ABM parameters to reproduce observed data, the calibrated model can capture the intricate dynamics of the system and generate synthetic data that accurately reflects its behaviour. This approach ensures that the synthetic data generated maintains a high level of realism, which is essential for various applications such as testing new fraud detection algorithms, evaluating system performance, or developing robust risk management strategies.

Data Augmentation

Data augmentation techniques can be applied in combination with ABMs to expand the available dataset and generate additional synthetic data points. This involves applying transformations, perturbations, or modifications to existing real data to create new synthetic samples. By introducing variations and noise to the original data, data augmentation techniques can generate diverse synthetic data points while maintaining the statistical properties and patterns observed in the real data. This can be especially useful when creating new training data to perform simulation-based inference where the simulation budget is restricted.

Summary of hybrid synthetic data generation techniques

Hybrid approaches typically combine deep learning with traditional simulation-based approaches. For example, agents within an ABM could be replaced with a neural network to increase the overall flexibility of the simulation while retaining its explainability. By leveraging the strengths of different methods, hybrid approaches can generate synthetic data that captures various aspects of the real data, including complex patterns, statistical properties, and behavioural dynamics. These methods can be tailored and combined based on the specific requirements of the synthetic data generation task. By integrating multiple techniques within ABMs, organizations can generate synthetic data that accurately represents the complexity and characteristics of real-world systems while preserving privacy and maintaining data utility.

Conclusion

In the realm of synthetic data generation, both deep learning and ABMs provide powerful tools for generating tabular data. Deep learning models excel at capturing complex data distributions, while ABMs simulate the complex dynamic behaviours of systems based on predefined rules. When generating tabular synthetic data for a payment processor while ensuring privacy, a combination of both methods may be employed. Deep generative methods can capture the intricate data patterns, while ABMs can simulate the dynamic aspects of the payment system. Novel deep-learning techniques, such as simulation-based inference (SBI), can be used to calibrate ABMs to create realistic synthetic data. Implementing privacy-preserving techniques, such as anonymization, differential privacy, and careful calibration of agent behaviours, is essential to safeguard private customer data. By leveraging the strengths of deep generative models and other techniques with ABMs and adopting privacy-centric approaches, organizations can generate high-quality synthetic data that supports analysis while upholding data privacy standards. Most importantly, generating synthetic data relies on careful understanding of the data space especially when this synthetic data is used in machine learning pipelines where care must be taken to avoid the perennial ‘garbage-in, garbage out’ problem. In fact, using poorly constructed synthetic data can lead you into a quagmire and do more harm than good.

Vacslav Glukhov

1 年

Thanks, Justin, a good exposition. What do you think about data augmentation techniques where we actually taking the model beyond "statistical properties and patterns observed in the real data". Say, if the ABM we use to create synthetic data is reasonably realistic -- do you think we can test hypothetical/probable/feasible scenarios on the edges of the corresponding state and action spaces, for which we don't have data (enough or or at all)?

1 次回应

Justin E. Lane

Using AI to bring people together | We’re hiring!

1 年

On of these days, people will appreciate the ability of AMBs to generate that data. They'll likely treat it as an innovation as if we haven't been doing it for a decade.

2 次回应

查看更多评论

要查看或添加评论，请登录

Justin Lyon的更多文章

Understanding AI Simulation

2024年6月11日

Understanding AI Simulation

As the CEO of Simudyne , I'm thrilled to share some insights into the revolutionary convergence of artificial…

7 条评论
Simulation and it's Implications for Policy Design

2024年5月23日

Simulation and it's Implications for Policy Design

At Simudyne, we firmly believe that simulation-assisted policy design is critical for understanding complex social…

3 条评论
Uniting Ancient Methods and Advanced AI in Decision-Making

2024年1月10日

Uniting Ancient Methods and Advanced AI in Decision-Making

The journey from the ancient Greek academies to the modern digital arenas of today has been underpinned by two critical…

8 条评论
The Future of Telecom Marketing

2023年8月15日

The Future of Telecom Marketing

The Future of Telecom Marketing: Agent-Based Modelling with the Simudyne SDK The rapidly evolving telecom landscape…

1 条评论
Unleashing the Power of Agent-Based Computational Economics with Simudyne's SDK

2023年6月23日

Unleashing the Power of Agent-Based Computational Economics with Simudyne's SDK

Introduction In the realm of computational economics, the emergence of agent-based models has opened up new avenues for…

1 条评论
Enhancing Risk Management with Simudyne SDK and Agent-Based Modelling: Top 5 Reasons

2023年6月15日

Enhancing Risk Management with Simudyne SDK and Agent-Based Modelling: Top 5 Reasons

In the ever-evolving landscape of risk management, top-performing firms are continually seeking innovative approaches…
Utilising the Simudyne SDK for Risk Management in Banking

2023年5月31日

Utilising the Simudyne SDK for Risk Management in Banking

Introduction In the complex world of banking and finance, risk management plays a crucial role in ensuring the…
Risk modelling v3.0

2017年11月23日

Risk modelling v3.0

The financial services industry is highly regulated to ensure markets run effectively, consumers are protected and…

1 条评论
Derivative trade modelling

2017年2月15日

Derivative trade modelling

Every day, a large bank will handle millions of derivative trades. Doing so requires them to manage a huge number of…
Artificially intelligent simulations for financial institutions

2017年2月7日

Artificially intelligent simulations for financial institutions

Simudyne’s Providence software is used by Financial Institutions to create AI-powered simulations that identify ways to…

6 条评论

See all articles

Swimming through a quagmire of synthetic garbage data

Justin Lyon

CEO, Simudyne helps institutions solve complex problems and make better decisions. Barclays Techstars '17

Exploring Synthetic Data Generation

Rule-based

Statistical model-based

Deep generative models

Simulation- and agent-based models (ABM)

Summary of synthetic data generation techniques

领英推荐

Multi-method synthetic data generation

Rule-based generators

Statistical approaches

Machine learning techniques

Simulation-based inference with neural networks for ABM calibration

Data Augmentation

Summary of hybrid synthetic data generation techniques

Conclusion

Justin Lyon的更多文章

社区洞察

其他会员也浏览了

Agentic Swarm Intelligence with LLMs

Generative AI in Biology: Exploring Mathematical Functions Behind the Methods

Navigating the Algorithmic Landscape(Linear/quadratic discriminant analysis): Quick reference for development teams and Researchers...

Deep Learning Advancements in 2023 and Beyond

New Special Issue "Advances in Bayesian Optimization and Deep Reinforcement Learning" is open for submission

New Book on Synthetic Data: Version 3.0 Just Released

Deep Learning: Earth Science's Crystal Ball

Why math is easy for AI but gardening is not

The Top 10 Tools Revolutionizing the IT Sphere

Train a Model with Neural Networks, for Responsible Gaming Predictions and Monitoring

Exploring Synthetic Data Generation

Rule-based

Statistical model-based

Deep generative models

Simulation- and agent-based models (ABM)

Summary of synthetic data generation techniques

领英推荐

Multi-method synthetic data generation

Rule-based generators

Statistical approaches

Machine learning techniques

Simulation-based inference with neural networks for ABM calibration

Data Augmentation

Summary of hybrid synthetic data generation techniques

Conclusion

Justin Lyon的更多文章

Understanding AI Simulation

Simulation and it's Implications for Policy Design

Uniting Ancient Methods and Advanced AI in Decision-Making

The Future of Telecom Marketing

Unleashing the Power of Agent-Based Computational Economics with Simudyne's SDK

Enhancing Risk Management with Simudyne SDK and Agent-Based Modelling: Top 5 Reasons

Utilising the Simudyne SDK for Risk Management in Banking

Risk modelling v3.0

Derivative trade modelling

Artificially intelligent simulations for financial institutions

社区洞察

其他会员也浏览了

Agentic Swarm Intelligence with LLMs

Generative AI in Biology: Exploring Mathematical Functions Behind the Methods

Navigating the Algorithmic Landscape(Linear/quadratic discriminant analysis): Quick reference for development teams and Researchers...

Deep Learning Advancements in 2023 and Beyond

New Special Issue "Advances in Bayesian Optimization and Deep Reinforcement Learning" is open for submission

New Book on Synthetic Data: Version 3.0 Just Released

Deep Learning: Earth Science's Crystal Ball

Why math is easy for AI but gardening is not

The Top 10 Tools Revolutionizing the IT Sphere

Train a Model with Neural Networks, for Responsible Gaming Predictions and Monitoring