Reimagining Reality
DALL·E 3's depiction of synthetic data, aka "AI Inception"

Reimagining Reality

Potential and Pitfalls for Synthetic Data in Digital Twins

I recently wrote an article (Digital Twins and AI: The Dynamic Duo) about a fictional but completely plausible scenario involving AI-based automation responding to an emergency in a process plant. Although I barely alluded to it in that article, such an application can’t be created without synthetic data, which is a key enabler for numerous AI-based and non-AI applications. In this article, I’ll provide a foundation for understanding what synthetic data is, review common applications for digital twins, and discuss different approaches to generating this data. Since this will be a slightly longer and more technical article than I usually write, let’s jump right in.

The 101 of Synthetic 0s and 1s

Let’s start with a common understanding of synthetic data – what it is, some of its applications, and important “warning labels” related to its use. In short, synthetic data is artificially created information that is designed to mimic real-world data without directly copying it. This data can be static or a time series that represents practically anything from images, 3D models, sensor and control inputs to internal system state estimates, location information and more. Synthetic data is being used today in fields as diverse as healthcare, agriculture, finance, and robotics, just to mention a few. The key point is that synthetic data should reflect potential reality but with no direct dependence on real-world observations.

It’s clear what synthetic data is, but why use it? Quite simply, gathering real-world data is frequently not a practical option. Consider the case of autonomous vehicle development. An autonomous vehicle needs to safely operate in a huge number of potential situations, but it’s impractical (not to mention dangerously reckless) to deploy early-stage development vehicles on public roads so that vehicles can “learn by doing.” Not surprisingly, every major autonomous vehicle platform in the world currently uses synthetic data to simulate real-world conditions to train perception systems. Developers can generate extensive “corner cases” to refine system behavior in the most challenging scenarios, at much lower cost and with zero risk to the public. Of course, synthetic data can’t replace real-world experience, but it can be used to proactively identify and mitigate risks, as well as find opportunities for optimization.

Minimizing safety risks during development and testing is only one of the motivations for incorporating synthetic data into applications. Collecting, validating, and annotating real-world data can be prohibitively expensive and time-consuming, and of course money and time are key constraints for any project. Privacy is another consideration, since real-world datasets can contain personally-identifiable information (PII) – seriously constraining how such data can be used and distributed.?

While synthetic data can offer major advantages over real-world data, it is not a panacea and comes with potential pitfalls. One of these pitfalls is bias. Synthetic data should proportionately represent reality in a way that is appropriate for the application. This is easily understood in the context of synthetic digital humans, where it’s important to avoid bias related to demographic or cultural factors. Bias can creep in almost anywhere, however – considering our earlier autonomous vehicle example, bias can include variables such as weather, time of day, road conditions, signage and more. Another pitfall is realism, which is a bit of a double-edged sword. If synthetic data is not realistic enough, negative consequences can range from merely inconvenient (e.g., poor user experience in an employee training application) to highly damaging (e.g., a safety-critical system failing to respond appropriately). Conversely, hyper-realistic synthetic data can be equally problematic. In the real world, signals contain noise, and a system designed using only perfect data can lack robustness.

One of the trickier aspects of synthetic data is novelty, or rather the risk that synthetic data lacks novelty. The phrase “truth is stranger than fiction” reflects something fundamental about reality, and anyone who has lengthy experience with applications at scale probably has more than a few stories about events that occurred despite being highly improbable. Just as it is difficult to prove a negative, it’s difficult to identify missing data that represents unlikely but possible scenarios. It’s important for synthetic data to encompass outliers, but it’s not always easy to accomplish this.

Synthetic Data, Meet Your Twin

Now that we’ve outlined the “why” of synthetic data and reviewed associated advantages and risks, it’s time to dive deeper into specific application opportunities. The first and most obvious use for synthetic data is for training machine learning (ML) models, as exemplified by the autonomous vehicle example in the previous section. The potential extends far beyond autonomous vehicles. Synthetic data is a great solution to the need for large volumes of visual data to train perception and classification systems for applications as varied as robotics, retail and warehouse automation, security systems and more. These data aren’t limited just to images, either. Synthetic 3D models can be incorporated into a scene along with various lighting and environmental conditions to simulate sensor responses at multiple locations in a volume as inputs to ML training. Parameters such as object reflectivity across different wavelengths can be incorporated, as can active sensors such as LIDAR and radar. It’s also possible to introduce noise to signals or simulate component faults synthetically to understand the limits of system robustness, and to identify ways for the system to better adapt. There are of course limits to what can be simulated in real-time. Reduced-order models can help, but compute constraints are still a consideration. However, real-time tools have improved recently, and using synthetic data in these tools is starting to become mainstream for training vision systems.

Synthetic Data with Annotation for Machine Vision Training (provided courtesy of FS Studio)

Not only is synthetic data useful for training machines, it’s also suited for training people. Interactive visual training has been shown to produce superior results to other methods, and customers who adopt this technology tend to adopt it permanently. One constraint holding back progress however is the cost and complexity of building interactive scenes, especially for mixed reality applications. Synthetic data can be a very useful way of producing significant portions of content for a scene in less time and at lower cost. This same technique is useful for other applications such as interactive marketing visualization (e.g., configurators) and architectural visualization. Synthetic tools to generate terrain, vegetation, humans and other scene elements can provide a huge productivity and efficiency boost for artists and developers creating these applications.

Perhaps the ultimate opportunity for the application of synthetic data is for accelerating the development of embodied AI systems. Embodied AI is a very promising technique whereby an AI-enabled agent interacts with the real world to learn through multisensory stimuli and to observe the results of actions initiated by the agent. It doesn’t take much imagination to come up with examples of ways this could result in undesirable outcomes – but by using synthetic data within a virtual real-time model of the real world, an embodied AI agent can learn in a safe “sandbox.” In effect, the agent becomes one actor in a generative adversarial network (GAN) that continuously escalates scenarios to improve agent capabilities.

These are just a few examples of ways to use synthetic data together with digital twins, drawn from a much larger set of applications that will only grow in the future as innovation unlocks new possibilities.

Beyond LLMs

A casual observer of AI would deduce from media coverage that transformer models such as ChatGPT, Dall·E 3, and Gemini are the “sine qua non” of all things AI. To be sure, transformers represent a real breakthrough and have potential for many applications, including synthetic data generation. A more holistic approach to synthetic data requires more than one tool in the toolbox, however.?

In certain cases, AI is overkill for generating synthetic data. Consider a project to develop a ML-based diagnostic system for an electromechanical system with well-understood operating parameters, a model-based simulation of the system and a complete fault tree analysis. Creating synthetic data to train this ML is merely a matter of exercising the model-based simulation over the entire operating range and virtually triggering faults in the system. The resulting outputs can be used directly to train the ML-powered diagnostics. Not coincidentally, this is exactly the scenario I depicted in the article I referenced in the first sentence.

Established AI techniques other than LLMs can be ideally suited to generating synthetic data, depending upon the application. Techniques such as GANs, variational autoencoders (VAEs) and diffusion models are viable alternatives to LLMs. GANs and VAEs are well suited to generating time-series data with minimal risk of AI hallucination, for example. Researchers are continuing to push the state of the art with techniques such as generative adversarial transformers, which combine elements of GANs with those of transformers.?

That said, LLMs can indeed be useful for synthetic data generation, particularly when combined with techniques like procedural modeling. By coupling a LLM to the algorithms and parameters used to generate models, a developer or artist can constrain outputs within an envelope of realism while dramatically accelerating the speed and variability of content generation.

What Next?

It’s easy to be blinded by technology, especially when that technology is a groundbreaking innovation. It’s important to step back and consider where this technology can be applied practically in your business. This isn’t an argument to wait – I firmly believe that successful companies will be the ones who embrace innovative uses of AI and synthetic data. At the same time, I wouldn’t recommend jumping in headfirst without forethought. The best way to proceed is to leverage the knowledge of someone with experience who can help you navigate the perils on the way to achieving the potential of this innovation. It’s equally important to partner with vendors who have a proven ability to translate this technology into practical implementations.


Do you have problems or opportunities that can be addressed with synthetic data? Feel free to reach out. A call costs nothing but 30 minutes of your time.

Special thanks to Jose de Oliveira of Microsoft and Tim Martin of FS Studio (https://fsstudio.com ) for their contributions to this article. Their contributions represent their own views and do not necessarily represent the views of their associated companies.

Ed Martin is the founder of Twinsight Consulting LLC. Learn more at https://twinsightconsulting.com .

Super article from Edward Martin warning us that, in a technology area as powerful, broad-reaching, and impactful as Digital Twins, you need the right mindset and experience to guide your projects through the many pitfalls that are certain to exist! Few people in the industry can provide this level of know-how, and Ed is absolutely one of those!

David Varela

Real-time 3D Technology Executive | Driving Business Growth & Product Strategy

8 个月

Another great article Edward Martin. Topic aside, it is refreshing to read your articles/stories full of innovative tech wrapped in a digestible down-to-earth approach of how things are done in the real world.? Over the last 10 years we experienced an unprecedented boom in technologies that ‘would transform industry forever’, and mostly contributed to a fair share of a billions-worth ‘Proof of Concept’ purgatory because of the lack of business and engineering rigor in many tech-driven decisions. Now that cash is more scarce, I bet companies will turn to knowledgeable pragmatic experts like you to help them make every penny count.

Adam Crespi

Staff Technical Artist at Capgemini

8 个月

Speaking as a designer and maker of synthetic data, this reads very true to me. I find often the nuance is in humans understanding what makes for good data, and what can be discarded or not considered. For example, on my current project, we are spending considerable time making complex randomized versions of varying urban archetypes, but purposefully neglecting much of the street FF&E surrounding the buildings; we have also over simplified the roadways. This is a pattern I have seen and crafted on many occasions, where the models and scenes constructed to produce synthetic data are exquisitely rendered but contextually odd for humans to view. This is of course not a problem, as the synthetic data is not meant for mere human consumption; it is made to feed the machines. Where there is a confluence of a digital twin and synthetic data needs particularly interests me, as the data lake a twin sits on is ripe for contextual statistical variation. It also allows data mining and analysis to identify trends and outliers that can shape the synthetic data design and domain randomization.

Tobin Jones

Driving Collaboration at Unity Cloud || Creative Manager | Visualization Supervisor | Public Speaker l Industry Consultant | Pixar ILM DWA Apple

8 个月

Excellent article Edward Martin. Reminds me of two cases I heard from an engineer working in autonomous vehicle training. The first, they trained the system on traffic cones, but then the it started identifying all A frames (ladders etc) as traffic cones. A big problem. The second, they made detailed animations of pedestrians walking, when it turned out static images of people in stride moving linerally was more than sufficient. Takeways: like you said, finding the right amount of information is key. We have our own biases and it’s trap to input that as part of the system as we train AI to replace tasks.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了