Synthetic data: A new challenge in the data management scenario

Synthetic data: A new challenge in the data management scenario

by Michele Iurillo ([email protected] )*

Synthetic data are artificially generated data, that is, information that does not come from direct observations of the real environment, but is obtained using advanced computational techniques. These data are generated from statistical and machine learning algorithms, capable of creating distributions and characteristics similar to those observed in real data sets, preserving key statistical patterns without containing sensitive or identifiable information.

The main purpose of synthetic data is to serve in the development, testing and validation of models, especially in environments where access to real data is limited or restricted for privacy or security reasons. It is often used in the validation of mathematical models and in the training of deep neural networks for deep learning applications. Using this data, models can learn complex patterns and relationships without exposure to real data, which facilitates privacy protection and acceleration of development and testing processes in artificial intelligence systems.

The advantage of using synthetic data is that it reduces constraints when using regulated or sensitive data. And it creates data needs according to specific requirements that cannot be achieved with authentic data. Synthetic data sets are usually generated for quality assurance and software testing but can be the start of new scenarios when observing a reality that does not really exist.

The disadvantage of synthetic data is the inconsistencies that occur when trying to reproduce the complexity of the original data and their inability to directly replace authentic data, as accurate data are still needed to obtain useful results. But they can be a formidable starting point for proofs of concepts for algorithms, although their “unreal” nature should not be forgotten. Data scientists have to take it with a grain of salt and avoid drawing conclusions to avoid the kinds of “sampling” biases.?

Real vs. synthetic data

Real data is collected or measured in the real world. This data is created every instant a person uses a smartphone, laptop or computer, wears a smartwatch, visits a website or makes an online purchase.

Synthetic data, on the other hand, is generated in digital environments. This data is fabricated in a way that successfully mimics real data in terms of basic properties, except for the part that has not been obtained from any real-world event.

Using the various synthetic data generation techniques, the necessary training data for machine learning models is readily available, which makes the synthetic data option very promising as an alternative to real data. However, it cannot be stated emphatically that synthetic data can provide answers to all real-world problems. This does not affect the significant advantages offered by synthetic data.

Challenges and limitations of the use of synthetic data

While synthetic data offers several advantages to companies with data science initiatives, it also has certain limitations:

Data reliability: It is a well-known fact that any machine learning/deep learning model is only as good as its data source. In this context, the quality of synthetic data is significantly associated with the quality of the input data and the model used to generate the data. It is important to ensure that there are no biases in the source data, otherwise they could very well be reflected in the synthetic data. In addition, the quality of the data must be validated and verified before using it for any predictions.

Requires expertise, time and effort: Although synthetic data may be easier and cheaper to produce than real data, it requires a certain level of expertise, time and effort.

User acceptance: Synthetic data is a new notion, and people who have not seen its advantages may be unwilling to trust predictions based on it. This means that awareness of the value of synthetic data must first be raised to gain wider user acceptance.

Outlier replication: Synthetic data can only resemble real-world data; it cannot be an exact duplicate. As a result, synthetic data may not cover some outliers that exist in the authentic data. Outliers in the data may be more important than normal data.

Quality checking and result control: The goal of creating synthetic data is to mimic real-world data. Manual checking of the data becomes critical. For complex data sets generated automatically using algorithms, it is imperative to ensure the correctness of the data before implementing it in machine learning/deep learning models.

Challenges and limitations of using synthetic data.

Here are some real-world examples where synthetic data is being actively used.

Healthcare: Healthcare organizations use synthetic data to create models and a variety of dataset tests for conditions that do not have real data. In medical imaging, synthetic data is being used to train AI models, while always ensuring patient privacy. In addition, they are employing synthetic data to forecast and predict disease trends.

Agriculture: Synthetic data is useful in computer vision applications that help predict crop yields, crop disease detection, seed/fruit/flower identification, plant growth models, etc.

Disaster prediction and risk management: Government organizations are using synthetic data to predict natural calamities for disaster prevention and risk reduction.

Automotive and robotics: Companies use synthetic data to simulate and train self-driving cars/autonomous vehicles, drones or robots.

Finance: Banks and financial institutions can better identify and prevent online fraud, as data scientists can design and develop new effective fraud detection methods using synthetic data.

E-commerce: Companies benefit from efficient warehouse and inventory management, as well as an improved online shopping experience for customers, thanks to advanced machine learning models trained with synthetic data.

Manufacturing: Companies benefit from synthetic data for predictive maintenance and quality control.

Conclusions

Synthetic data open new possibilities as long as we can understand that they are not real and their use has to be specially oriented to the training of the models. It is very dangerous to think that a management of these data can be enough to train the models, we will always have to confront us with real data and see that the model works because in this way we will avoid biases.

  1. Potential for the development of AI and machine learning: Synthetic data has been consolidated as a fundamental tool for the development and improvement of machine learning models, allowing to train and validate algorithms in controlled environments and with larger volume of data, even when real data is scarce or restricted.
  2. Data privacy and security protection: Because it does not contain real information, synthetic data enables the creation of high-value statistical representations for testing and development without compromising the privacy of individuals or the security of sensitive information. This opens up significant opportunities for sensitive sectors, such as healthcare and finance, where regulatory compliance is key.
  3. Resource optimization and cost reduction: Generating synthetic data can be cheaper and more efficient than collecting real data, especially in sectors where access to quality data is limited. By eliminating the need to collect costly or difficult-to-obtain data, companies can optimize their resources and reduce costs associated with managing and storing real data.
  4. Challenges in the representativeness and accuracy of the data generated: Although synthetic data offer multiple advantages, their use involves challenges. The fidelity with which real data patterns are represented is crucial, as any deviation could affect the accuracy and applicability of the trained models. This underscores the importance of using advanced algorithms and rigorous supervision in their generation.
  5. Impact on data governance and quality: The inclusion of synthetic data in data management requires a review of governance policies and quality standards. Organizations must establish clear criteria for differentiating, managing and auditing these data, ensuring that they maintain utility and avoid unwanted biases in analytical and predictive models.

The use of synthetic data in data management represents a significant advance, especially for highly regulated industries with limited access to real information. However, its adoption requires a careful technical and ethical approach to maximize its benefits and minimize the risks of bias, ensuring its correct integration into artificial intelligence and machine learning systems.

We will address this argument in one of the panel discussions at Data Management Summit 2025.

Inspiring articles:?

https://www.turing.com/kb/synthetic-data-generation-techniques#what-is-synthetic-data ?

https://www.edps.europa.eu/press-publications/publications/techsonar/synthetic-data_en#:~:text=Synthetic%20data%20is%20artificial%20data,undergoing%20the%20same%20statistical%20analysis .

https://www.ibm.com/topics/synthetic-data

https://blogs.manageengine.com/espanol/2023/03/15/synthetic-data-para-que-sirve-html.html

  • *Generative AI tools have been used for the translation of this article.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了