Swimming through a quagmire of synthetic garbage data
Justin Lyon
CEO, Simudyne helps institutions solve complex problems and make better decisions. Barclays Techstars '17
Exploring Synthetic Data Generation
The growing need for synthetic data has led to the development of various techniques to generate realistic and privacy-preserving datasets. These techniques include:
1.???? Rule-based
2.???? Statistical model-based
3.???? Deep generative models
4.???? Simulation- & agent-based models
Rule-based
The simplest approach to generating synthetic data is using explicit hard-coded rules and conditions. This rule-based approach relies on predefined logic and decision-making processes to mimic the behaviour of real-world data. Rule-based methods can be useful when the data generation process is well-understood and can be accurately represented using deterministic rules. For example, image augmentation techniques can generate synthetic data by rotating, translating and adding noise to images. Using augmented images during training has been shown to improve accuracy in certain computer vision tasks. Rule-based approaches can incorporate more advanced algorithms and techniques to enhance the realism and complexity of the generated data. It may involve incorporating machine learning or expert systems to develop intelligent rules that can adapt and evolve based on the input data. For example, an intelligent rule-based method could be used to augment images to represent different lighting scenarios by using complex models that capture how shadows change under different light. This could make a computer vision model robust to low-lighting environments, for example. Intelligent rule-based methods are effective when the data generation process involves complex relationships and patterns that can be learned from existing data.
Statistical model-based
Rather than relying on hard-coded rules, synthetic data can be generated by looking at the statistical properties of the underlying dataspace. This statistical model-based approach to data generation leverages statistical models to generate data that closely resembles the statistical properties and patterns observed in real-world data. It involves fitting a statistical model, such as a Gaussian mixture model or Gaussian process, to the existing data and then sampling from the approximated distributions to give new data points that exhibit similar characteristics. Statistical model-based methods are useful when the data generation process can be described using probabilistic distributions and when capturing the statistical properties of the data is crucial.
Deep generative models
Often, the underlying dataspace is high-dimensional. Deep learning models, based around artificial neural networks, are well designed for high dimensional datasets and several deep learning models have been shown to be highly successful as generative models. These include variational auto-encoders (VAEs), normalising flows (NFs), deep diffusion models (DDM), and generative adversarial networks (GANs). There are significant differences between these different methods including whether they rely on learning a probabilistic representation of the dataspace (as with VAEs and NFs) and whether the approximated dataspace is accessible or implicit (as with GANs). For example, GANs typically consist of two components: a generator and a discriminator. The generator learns to generate synthetic data that resembles the real data, while the discriminator learns to distinguish between real and synthetic data. The generator and discriminator are trained simultaneously in a competitive manner, improving the quality and realism of the generated data over time. GANs are powerful for generating complex and high-dimensional data with fine-grained details. However, the generator does not explicitly model the underlying distribution of the data but instead is trained to reproduce samples similar to the training data.
On the other hand, explicit models attempt to model the entire dataspace. For example, autoencoders use an encoder-decoder structure to 'encode’ the data into a low-dimensional space, often called the latent space, and then ‘decode’ the latent space back into the original data point as accurately as possible. Sampling from the latent space allows for the generation of new synthetic data. Other methods learn probabilistic representations (such as VAEs) or aim to faithfully represent the data-generating process rather than the dataspace (such as DDMs). Deep generative models have attracted significant attention in recent years due to their aptitude in generating convincing images, such as the DALL-E models, and text, such as GPT models. However, deep generative models are typically ‘black boxes’ or ‘rule-free’ as it is typically not possible to explain why a certain example of synthetic data is produced or whether it is factually accurate. This raises significant challenges as these models are increasingly relied on in production environments, for example, real-time fraud detection.
Deep generative models are well-suited for generating tabular synthetic data due to their ability to capture complex data distributions. By training on real payment transaction data, a model such as a GAN or VAE can generate synthetic data that closely resembles the original dataset. However, caution must be exercised to ensure the protection of private customer data. Precautions, such as properly anonymizing and sanitizing the real data, implementing differential privacy techniques, and rigorously evaluating the models’ ability to preserve privacy, should be taken. Furthermore, the output of the data should be carefully checked to ensure that it is well representative of the training data, does not contain any hidden biases, and is coherent.
Simulation- and agent-based models (ABM)
Unlike deep generative models, simulation models use expert understanding of the underlying data generation process to reproduce the data-generation environment. For example, physics simulators use equations that describe the motion of fluid particles to generate realistic water effects, as used in CGI. A sub-class of simulators are agent-based models, which simulate the behaviour and interactions of autonomous agents within a given environment. These models capture individual-level decision-making processes, agent attributes, and the interactions among agents from a set of pre-defined rules. ABMs are particularly useful when the focus is on understanding emergent system-level behaviours that arise from the interactions of individual agents. They excel at capturing the temporal and dynamic aspects of real-world systems, making them valuable for generating synthetic data that reflects complex relationships and interdependencies, whilst remaining interpretable, explainable, and reliable.
Agent-Based Models (ABMs) can also generate tabular synthetic data for a payment processor. ABMs offer the advantage of capturing the dynamic behaviour of the payment system, including fraud detection, transaction volumes, and network effects. To protect private customer data, ABMs can be designed with privacy-enhancing mechanisms, such as using synthetic identities and transaction profiles. Additionally, careful consideration must be given to the calibration of agent behaviours and interaction rules to ensure the generated data does not reveal sensitive information and that the synthetic data realistically reproduces known features in the data.
Summary of synthetic data generation techniques
Each approach has its strengths and weaknesses, and the choice of method depends on the specific data generation requirements and the characteristics of the underlying data. Rule-based and intelligent rule-based methods are suitable when the data generation process can be represented using explicit rules or learned patterns. Statistical model-based approaches are effective for capturing statistical properties of the data. Deep Generative Models excel in generating high-quality synthetic data with complex patterns and details. ABMs are valuable for studying emergent behaviours and interactions among agents and reproducing synthetic data with a high degree of control.
领英推荐
The selection of the most appropriate method depends on the specific use case and the desired characteristics of the synthetic data. In practice, a combination of the methods is most often used. In almost all production use-cases in financial services, clients benefit from a multi-method approach which includes ABM.
Multi-method synthetic data generation
When generating tabular synthetic data in financial services, while ensuring privacy, the most robust approach would be to combine simulation-based models such as ABMs with other methods to generate synthetic data. Here are a few comments on how the various techniques can be used in conjunction with ABMs.
Rule-based generators
Rule-based generators involve defining specific rules and heuristics for generating synthetic data. These rules dictate the behaviours, attributes, and interactions of agents within the ABM. By carefully designing these rules, it is possible to generate synthetic data that reflects the characteristics and patterns observed in real-world data. Rule-based generators allow for fine-grained control over the data generation process and can be tailored to specific requirements.
Statistical approaches
Statistical methods can be employed to generate synthetic data within the context of ABMs. This involves analysing the statistical properties and patterns present in real-world data and using this information to generate synthetic datasets. Techniques such as bootstrapping, resampling, and Monte Carlo simulations can be applied to mimic the statistical distribution and variability observed in the original data. Statistical approaches are particularly useful when the focus is on preserving the statistical characteristics of the data rather than capturing intricate details.
Machine learning techniques
Machine learning algorithms can be integrated into ABMs to enhance improve the realism of synthetic data. For example, supervised learning techniques like decision trees, random forests, or neural networks, can be used to learn patterns and relationships from real data and these can be integrated into the simulation to replace model components where there is high uncertainty, such as customer behaviours. Other techniques apply to the simulation output such as using neural networks to calibrate an ABM. By leveraging the power of machine learning, ABMs can generate synthetic data that closely resembles the real data and captures complex patterns and relationships.
Simulation-based inference with neural networks for ABM calibration
Simulation-based inference is a valuable machine learning technique for calibrating simulations to real data and observations. Variations on this technique combine neural networks, such as neural density estimators, with embedding networks to translate high-dimensional observations into calibrated simulations. Here's how it can be applied to an agent-based model (ABM) for generating realistic synthetic data using a payment processor as an example:
Simulation-based inference enables the calibration of agent-based models to produce synthetic data that closely matches real-world data. By iteratively adjusting the ABM parameters to reproduce observed data, the calibrated model can capture the intricate dynamics of the system and generate synthetic data that accurately reflects its behaviour. This approach ensures that the synthetic data generated maintains a high level of realism, which is essential for various applications such as testing new fraud detection algorithms, evaluating system performance, or developing robust risk management strategies.
Data Augmentation
Data augmentation techniques can be applied in combination with ABMs to expand the available dataset and generate additional synthetic data points. This involves applying transformations, perturbations, or modifications to existing real data to create new synthetic samples. By introducing variations and noise to the original data, data augmentation techniques can generate diverse synthetic data points while maintaining the statistical properties and patterns observed in the real data. This can be especially useful when creating new training data to perform simulation-based inference where the simulation budget is restricted.
Summary of hybrid synthetic data generation techniques
Hybrid approaches typically combine deep learning with traditional simulation-based approaches. For example, agents within an ABM could be replaced with a neural network to increase the overall flexibility of the simulation while retaining its explainability. By leveraging the strengths of different methods, hybrid approaches can generate synthetic data that captures various aspects of the real data, including complex patterns, statistical properties, and behavioural dynamics. These methods can be tailored and combined based on the specific requirements of the synthetic data generation task. By integrating multiple techniques within ABMs, organizations can generate synthetic data that accurately represents the complexity and characteristics of real-world systems while preserving privacy and maintaining data utility.
Conclusion
In the realm of synthetic data generation, both deep learning and ABMs provide powerful tools for generating tabular data. Deep learning models excel at capturing complex data distributions, while ABMs simulate the complex dynamic behaviours of systems based on predefined rules. When generating tabular synthetic data for a payment processor while ensuring privacy, a combination of both methods may be employed. Deep generative methods can capture the intricate data patterns, while ABMs can simulate the dynamic aspects of the payment system. Novel deep-learning techniques, such as simulation-based inference (SBI), can be used to calibrate ABMs to create realistic synthetic data. Implementing privacy-preserving techniques, such as anonymization, differential privacy, and careful calibration of agent behaviours, is essential to safeguard private customer data. By leveraging the strengths of deep generative models and other techniques with ABMs and adopting privacy-centric approaches, organizations can generate high-quality synthetic data that supports analysis while upholding data privacy standards. Most importantly, generating synthetic data relies on careful understanding of the data space especially when this synthetic data is used in machine learning pipelines where care must be taken to avoid the perennial ‘garbage-in, garbage out’ problem. In fact, using poorly constructed synthetic data can lead you into a quagmire and do more harm than good.
Advisor | Consultant | Mentor | Engineer | Scientist | AI Research, Engineering, Quantitative Research, Algorithmic Trading, Analytics, Management Consulting
1 年Thanks, Justin, a good exposition. What do you think about data augmentation techniques where we actually taking the model beyond "statistical properties and patterns observed in the real data". Say, if the ABM we use to create synthetic data is reasonably realistic -- do you think we can test hypothetical/probable/feasible scenarios on the edges of the corresponding state and action spaces, for which we don't have data (enough or or at all)?
Using AI to bring people together | We’re hiring!
1 年On of these days, people will appreciate the ability of AMBs to generate that data. They'll likely treat it as an innovation as if we haven't been doing it for a decade.