What is synthetic data? How to use it in Machine Learning and AI.
Arivukkarasan Raja, PhD
PhD in Robotics with Applied AI | GCC Leadership | Expertise in Enterprise Solution Architecture, AI/ML, Robotics & IoT | Software Application Development | Service Delivery Management | Sales & Pre-Sales
#syntheticdata #machinelearning #datascience #ai #ml #bias #dataprivacylaw #python #numpy #abm #generativeai #mlops
One of today's most precious resources is data. But because it is expensive, sensitive, and takes time to process, gathering real data is not always an option. However, using synthetic data can be a good substitute when training machine learning models. In this article, we will define synthetic data, discuss its applications, discuss when and why it should be used, as well as the generation models and tools that are currently available.
What is synthetic data?
When it is difficult or expensive to obtain actual data, synthetic data, which is fake data that imitates real-world observations, is used to train machine learning models. Comparing synthetic data to enhanced and randomized data is not appropriate. Let's use the synthesis of a human face as a very crude example to show how synthetic data differs from the other approaches. Imagine that we had a collection of real-person images.
Basically, data augmentation involves adding slightly altered copies of already-existing pieces to the data set. By using data augmentation, we would enhance our data set with nearly identical faces but with slight variations in eye colour or skin tone.
Data randomizers don't produce new elements; they just change the items already present in the data pool. As a result, we would mix each of their facial features. For instance, the hair from person 1 would be coupled with the mouth from person 2, and the eyes from person 3, etc.
Synthetic data provides us with entirely new portraits of individuals who share the traits of the original data set, which does not fully represent the original real face. In essence, by creating synthetic data, we recreate something that already exists in the real world, obtaining its traits without directly resembling them, or a mash-up.
It's not necessary for synthetic data to be entirely computer created. If we consider human history prior to the invention of computers, synthetic data was also present, but it was only produced by people. For instance, a person who draws new faces could perform the same face generation.
It is also possible to generate numerical data, but this time, people would rely on their statistical and mathematical knowledge rather than computational resources. However, even with advancements in mathematics and probability theory, producing synthetic data without the aid of a computer is time-consuming and difficult in general.
3 problems with real datasets in ML and AI
To produce useful results, machine learning models require a large amount of training data. They require thousands of data points to produce results, even for a straightforward activity. The dataset must include millions of data points in order to perform more sophisticated operations (such as text, image, or video recognition).
This might be a problem. It's because your business must go through drawn-out data access procedures that can take up to 6 months before accessing real data for ML purposes. As a result, AI/ML projects may be abandoned or put off indefinitely.
You might also decide not to go through with this time-consuming data access procedure. This is due to the fact that you are unsure if this dataset is appropriate for your project at this time.?
Bias in machine learning is a mistake that happens as a result of incorrect learning algorithm assumptions. For instance, your business contains information on thousands of your clients. Utilize this information and follow the connections between demographic and purchasing behavior data.
There is nothing to learn if the original, authentic data is biased and doesn't demonstrate a genuine correlation between demographic data and purchasing patterns. You would need to identify a strong pattern and divide the data into various clusters with distinct properties in order to build an ML application.
The bias issue not only makes AI less effective, but it might also legitimise discrimination. The Washington Post reported that if you're a woman, Google featured far fewer ads for high-paying CEO jobs. Because of this, businesses should use reliable and representative data to feed machine learning algorithms. Learn more about bias in ML & AI here.
ML algorithms could benefit from real data to address a variety of commercial issues. However, laws governing privacy, such as the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), and the Health Insurance Portability and Accountability Act, also apply to Personally Identifiable Information (PII) and Personal Health Information (PHI) (HIPAA).
These rules place limitations on the collection and use of data from the actual world. They stop data leaks, unauthorised transfers, and misuse. It is difficult or even impossible to use real data if that is the case. First, it's because requesting secondary consent for an ML project makes the customer experience very difficult. Second, estimating the required data scope during the initial stages of AI projects is difficult. For instance, the extent of data penetration and the coverage of all edge cases.
When and why synthetic data is used?
From training navigational robot models to conducting research on radio signal recognition, synthetic data can be used for a variety of tasks. In reality, synthesised data can essentially fulfil any project's objective that calls for a computer simulation to forecast or study actual events. There are a number of important reasons why a company would think about employing fake data.
Time and money saving: If you don't have a suitable dataset, it can be much less expensive to produce synthetic data than it would be to gather it from actual events. The time factor is the same: while real data collection and processing may take weeks, months, or even years for some projects, synthesis can take just a few days to complete.
Investigating rare data: There are situations where accumulating data is risky or infrequent. A collection of exceptional fraud incidents is an example of rare data. Road traffic incidents, to which self-driving cars must respond, may serve as an example of dangerous real data. Then we can use synthetic accidents in their place.
Addressed privacy concerns: Privacy concerns must be taken into account when sensitive data needs to be processed or given to third parties. Making synthetic data, as opposed to anonymization, eliminates all identity traces from the original data, producing a new, reliable data set without jeopardising privacy.
Easy labeling and control: Technically speaking, fully synthetic data makes labeling easy. For example, if a picture of a park is generated, it’s easy to automatically assign labels of trees, people, animals. We don’t have to hire people to label these objects manually. And fully synthesized data can be easily controlled and adjusted.
Techniques to Generate Synthetic Data
Drawing Numbers From a Distribution
Contrary to more sophisticated, machine learning-based methods, picking random integers from a distribution is a common method for creating synthetic data. Although this method does not capture the insights of real-world data, it can produce a data distribution that closely resembles real-world data.
In this example, four datasets with a "normal" distribution of variables and a slight variation in the centerpoint will be produced using Python and the NumPy library's numpy.random.randn() function.
Agent-Based Modeling
ABM, or agent-based modelling, is a simulation technique that involves creating unique agents that communicate with one another. These methods are especially helpful when examining how different agents—like cells, people, or even computer programs—interact with one another in a complex system. Using pre-built core components, Python packages like Mesa make it simple to quickly create agent-based models and view them in a browser-based interface.
Generative Models
One of the most advanced methods for producing fake data is generative modelling. It can be characterised as an unsupervised learning task where insights and patterns in the data are automatically discovered and learned so that the model can be used to generate new examples with the same distribution as the real-world data it was trained on.?
In order to train generative models, it is frequently necessary to first collect a large amount of data in a specific domain (such as images, text written in natural language, or tabular data), and then train the model to produce more data that looks similar. The neural network-based generative models described below all use a number of parameters that are smaller than the input data they were trained on, which essentially forces the models to find patterns and insights in the data in order to generate new datasets. The models vary in terms of architecture, but they are all based on neural networks.
?Conclusion: context is key
Using a generative model to learn the joint probability distribution from real data, you can create synthetic datasets using samples of fresh data. The work becomes more difficult with larger datasets and situations where you need to capture more intricate connections, despite the fact that you can theoretically achieve it simply counting the unique entries in a table.
Deep learning models like variational autoencoders (VAE) and generative adversarial networks (GAN) have shown to be effective at such tasks. You might want to choose one over the other depending on the data's type, the user's skills, or the desired result.
Source:
Manager- PMO
2 年Great information Arivu, thanks for sharing this