From Pixels to Profits: How Synthetic Image Generation Changes Everything
Image generated with DALL-E.

From Pixels to Profits: How Synthetic Image Generation Changes Everything

You probably heard a lot about the impressive capabilities of Deep Learning models in various fields of application: image, speech, and text were first recognized by AI, then became manipulable and now the attention is on generative models that become better and faster at creating content.

Their popularity exploded after ChatGPT was released in November 2022, but a variety of solutions exist on the market or can be created from scratch for custom and/or highly confidential purposes. Regardless of its fame, one thing stands at the foundation of any machine learning model’s training and consequent success: collecting quality data.

However, collecting enough data can often be a challenge, especially when you consider the limitations of time, resources, and access to reliable datasets. But what if there was a way to overcome these hurdles and get more quality data efficiently?

Synthetic data generation paired with Domain Adaptation could be the answer: AI models can generate high-quality synthetic data to train other models.

In this article, we will dive deeper into generating images for Computer Vision models and how this can help overcome the limitations of classic data collection, such as data unavailability, privacy restrictions, or difficulty in acquisition due to high-precision requirements.

Typical application scenarios are those involving public environments where collecting data raises concerns about data privacy (e.g. objects or people detections on roads), or those with very specific tasks requiring the acquisition of data with expensive specialized hardware and/or procedures (e.g. gaze detection and eye tracking).

Synthetic images are created with extreme precision, simulating realistic scenarios and adapting them to the specific domain in which they will be used while providing a large saving in costs and time.

At the foundation of Foundation Models: Data.

Data is at the base of any AI model: if good-quality data is provided, the right model can solve any reasonable task.

Deep Learning models are very data-hungry, but simply having a lot of data is not enough to produce good results. “Good quality” is not just an empty adjective: it is the key point to ensure that the model is able to learn the correct patterns and correctly generalize the information, in order to obtain an optimal solution for the tasks it will be presented with.

As OpenAI’s co-founder Andrej Karpathy puts it, to be of “good quality” data must be: many, clean, and diverse.

In summary, quantity is undoubtedly an important factor, but data must also be complete, without noise or errors, and capture the diversity of real-life situations and tasks the model will face.

If the data used to train the model is of poor quality, incomplete, or unrepresentative, the model may learn incorrect or inadequate information, producing inaccurate and unreliable results. For this reason, collecting and curating quality data is a critical step in developing an AI model.

Time and effort must be invested in ensuring that the data is accurate, complete, and representative of the context in which the model will operate. Only with quality data is it possible to obtain reliable results and fully exploit the potential of AI.

No easy feat

In many industrial applications, building datasets compliant with the 3 pillars described above is costly or simply unfeasible.

In this era of digitalization and innovation, good data is very valuable, and companies are reluctant to share theirs.

So, what are the alternatives?

  1. One option is to use publicly available datasets released by researchers and academics. However, not all of them can be used for commercial purposes.
  2. Another solution is to purchase and label ad hoc data for the task we want to solve. But this can be very costly and time-consuming.
  3. Lastly, companies can use pre-trained models that provide neural networks with general capabilities. They have already been fed large datasets and can be good starting points for many tasks. However, they will need some fine-tuning.

This data access problem can become insurmountable for small and medium companies, that have limited resources compared to Big Tech. To read more on this topic, this article published in the Wall Street Journal explains how getting access to data is not just a matter of funds.

An example of a publicly available dataset for realistic textures and scene building (

The newest alternative: Synthetic Data Generation & Domain Adaptation

Consider a situation in which we have acquired some data in a production environment. They could be images of people and we might want to detect their actions. Or pictures of products on the shelves of a supermarket, to implement a smart checkout and inventory system. The datasets are available, but they are either few or typically unlabeled.

You could search online for a pose-detection dataset, but the context and point-of-view of its images are likely different from your production environment. For the market’s products, it’s even harder to find available matching pictures of the products of interest, taken from different angles.

Here is where synthetic data can help solve the task.

An example of synthetic data generation for typical supermarket products. Companies worldwide have begun piloting checkout-free retail in various forms. Computer Vision AI can power autonomous stores, but the costs and complexity are not to be taken lightly. However, thanks to Synthetic Data Generation, it will be possible to convert existing stores and cover many real-life scenarios, much faster. Source:

Synthetic data

For Computer Vision purposes, they are generally images generated by simulation software. First, 3D models of the objects of interest and the environments in which they are placed are created. They could either be created by a design team, available in existing applications, or generated entirely by an AI model. From those, it’s then possible to capture many images, with full control over the parameters (point-of-view, background, context, pose, etc.).

Moreover, this computer-aided design provides all the labels associated with the images (ie. the precise position of the product in the picture). With synthetic data generators, we can get a lot of clean, diverse images, hence a good-quality dataset.

Example of simulation software and its controllable parameters (logged on screen) for developing autonomous driving systems. Source:
Synthetic human bodies with the corresponding labels (such as body key points and bounding boxes). Source:

This variety can also be valuable for testing and validating AI models in controlled environments. By generating specific scenarios, anomalies, or edge cases, it becomes possible to evaluate the model’s performance and uncover potential weaknesses.

Lastly, synthetic data have another big advantage over real ones: privacy protection. Acquiring ad-hoc real images can be very difficult not only for the cost but also for privacy constraints. Synthetic data can be an effective means of preserving privacy.

By generating synthetic data that retains the statistical properties of the original data but does not disclose any sensitive information, it is possible to share or publish the synthetic data without compromising privacy.
At ICCV 2021 Microsoft presented how they generate 3D face models that include pixel-perfect segmentation, ten times as many landmark labels as usual, and a wide variety of realistic, diverse, and expressive faces with randomized identities. Source:

Domain adaptation

Domains are data grouped by a set of characteristics and environments. Domain Adaptation is the technique we use to bridge the gap between two different domains, aka datasets acquired in different environments and with different characteristics.

Domain adaptation allows us to train a model with a source dataset made of synthetic images generated with AI. Thanks to DA, the model will be capable of performing well when later used with our target (real) unlabeled data.

The various techniques we apply to reconcile source and target datasets are:

  1. Discrepancy-based methods: they use mathematical calculations to reduce the differences between the features extracted from synthetic and target data. This way, the model learns to identify image properties that remain invariant across the two domains (which tend to be more relevant for our purposes).
  2. Adversarial Discriminative methods: They pursue the same objective as discrepancy-based methods but use networks that are trained together on the same data, yet compete on opposing tasks. It works by training a main model to solve a chosen task on the synthetic data, while a second model (called a domain classifier) is trained to distinguish between real and synthetic images. This adversarial training teaches our main model to exclude the features which the domain classifier has defined as differentiators between real and synthetic images. Excluding them will automatically reduce the gap between the two domains.
  3. Adversarial generative methods: they use Generative Adversarial Networks (GANs) to modify source images and make them look more similar to the target ones.
  4. Self-Supervision-based methods: Self-supervised learning is a technique where a model aims at understanding important relations within data without relying on explicit labels (if you’d like to read more about Self-Superved Learning, we wrote another article on this topic ). In the context of domain adaptation, it can be used to learn which ones are the meaningful and domain-invariant features, which can then be useful when adapting the source dataset’s images to those in the target dataset.


Each of them has its advantages and disadvantages, and they can be more or less effective in certain given situations. The choice of method depends on the characteristics of the domains and the available data, as well as the specific requirements and constraints of the task at hand.

If you are interested in diving deeper into the technical aspects of these methods, here is a good survey on DA techniques: https://arxiv.org/abs/2009.00155

Application of CycleGAN, an adversarial generative method for domain adaptation. This network takes synthetic images from the famous video game Grand Theft Auto (GTA) as input — on the left, and produces a more realistic version of them — on the right. Source:

Practical Applications and Use Cases

Unity 3D (a popular rendering engine and editor to create interactive content) provides a way to implement your own data generation software thanks to its Perception package . Many pre-made simulators built on Unity can be found online, with large libraries of ready-to-use 3D assets (people or objects). Whether you use those assets or import 3D models of your own objects of interest, you can then easily set the scene and, very quickly, you’re ready to start acquiring data. What’s more, this software can be freely used for commercial purposes.

Here’s a real-world example of how we can use this technology: imagine a smartphone application for training at home, and one of its features is to guide the user to correctly do the exercises, in real-time. The application will recognize how good is the posture while the user is doing the exercise and will be able to tell which body part has to be adjusted and how. To do this, not only do we have to teach a model which are correct postures for each exercise, but also have to estimate the pose of a person by recognizing its body key points and segments.

Our problem here is we need lots of images of people in those very specific poses (the exercises we want to include in the app), with a variety of shooting positions, often from unusual angles (ie. the smartphone could be placed on the floor, or on a low piece of furniture). For our model to reach optimal performance, we need a dataset that represents our situation with enough quantity, diversity, and labeled information.

Some datasets for pose estimation with commercial licenses are publicly available and suitable for pretraining our network on the task. However, they contain people doing generic actions, in generic positions and at different distances from the camera, and will require some fine-tuning data.

An example of a publicly available dataset (PeopleSansPeople). This framework generates images of people in highly randomized scenes (different poses, backgrounds, points of view) along with their labels (bounding boxes, skeleton key points, and more). Source:

We could build a dataset ad-hoc by acquiring images, captured with a smartphone and of people doing the exercises we need, but there are some problems:

  • Data quantity: since the application still does not exist, we have no users from which to acquire the data
  • Data privacy: imagining starting a crowd-sourced data-gathering process, we must be compliant with the data privacy regulations
  • Data labeling and cleaning: all acquired images must be manually annotated, a time-consuming and error-prone process

Creating a good-quality dataset in this way can be really difficult. Realistically, we would be able to acquire a few hundred unlabeled images that cover different situations.

We must create the fine-tuning dataset in a different way.

Unity researchers provided the PeopleSansPeople data generator, a tool for generating data involving humans for tasks such as detection and pose estimation. Images are generated with random human models, backgrounds, poses, viewpoints, and lighting conditions, to form a good-quality dataset. The developers released the code in a public repository with a commercial license, so we can use it as a starting point to control the poses and viewpoints of the generation process. This way, we can acquire a more suitable labeled dataset for our task.

Now we have a good synthetic dataset and a set of unlabeled images acquired from the real environment of our application. However, simply fine-tuning the network on the synthetic data will lead to poor performance due to the domain gap. Domain Adaptation will solve this issue — for example, we can use the Adversarial Generative method we mentioned previously, where a Generative Network trained with our data will convert a synthetic image from PeopleSansPeople into an output image more similar to our real-world production case.

The synthetic images we generated, that have then been translated into the real-world domain by our neural network, can now be used to fine-tune our pose estimation model with great accuracy.
Pipeline of the AI solution for the workout app

In a world filled with real-world challenges in data acquisition, we’ve explored a groundbreaking solution: synthetic data generation.

By harnessing the power of artificial intelligence, we can now overcome the limitations of traditional data collection methods. Thanks to the rapid advancement of technologies and frameworks, the integration of synthetic data generation and adaptation into machine learning pipelines has become remarkably seamless and user-friendly.

In summary, although it may initially seem like an additional burden, leveraging generated data holds the key to slashing costs and speeding up development. Embracing the realm of synthetic data empowers us to propel our clients’ projects to new heights.

Are you ready to break free from the constraints of traditional data acquisition methods?

This article was written by Jason Ravagli , Machine Learning Engineer at Artificialy SA.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了