Disappearing data moats
Are companies overvaluing their data moats in our new world of AI? A deeper exploration into synthetic data suggests this may be the case…
Many companies believe they have strategic moats made from consumer data. Data they’ve spent years aggregating, which now appears more valuable in this AI-centric world. The phrase “data is the new oil” has been used to describe data as an asset that differentiates the haves from the have-nots.?
In addition to the perceived value of data, we’re seeing significant investment in Generative AI (GenAI) applications.? To me, there was an obvious area for value to be extracted from the market - the infrastructure (NVIDIA) and data (Meta, Google, Amazon). Chamath Palihapitiya reaffirmed my conviction multiple times in different venues over the last year.?
However, through researching data in generative AI, I discovered an under-discussed trend - synthetic data. This led me to realize that data is NOT the new oil and that these strategic data moats are shrinking. Human-generated data will likely become less valuable, shifting value towards synthetic data.
I’m not alone in this perspective. Gartner predicts that by 2024, 60% of the data used in AI will be synthetically generated. They also estimate that by 2030, synthetic data will overshadow real data in AI models, meaning nearly all data used to train AI by 2030 will likely be synthetic.
But that’s not all, let’s see what Sam Altman has to say…?
“In May of 2023, Sam was asked whether he was worried about regulatory probes into ChatGPT’s potential privacy violations. Sam brushed it off, saying he was “pretty confident that soon all data will be synthetic data”.”?
In a future where synthetic data is preferred over human-generated data, several key changes will emerge:
Synthetic data is poised to surpass human-generated data in both volume and quality, challenging the notion that real data is always superior. Real-world data is often problematic - it's messy, biased, and fraught with privacy issues. Synthetic data, on the other hand, is cleaner and more controlled.
Now, you may wonder how this is all possible.
Synthetic data generation - 101
Synthetic data generation, with a history spanning decades, primarily found its application in simulations. A notable example is Tesla's extensive simulation of outlier scenarios for their self-driving vehicles.
The methods for generating synthetic data vary, each suited to particular uses and having its own set of trade-offs. Some of these methods include:
Today's primary methods for generating synthetic data start with "seed data," which is originally human-generated. This seed data serves as a base to ensure the synthetic version remains statistically similar to the original.
Experts in synthetic data generation focus on three key quality metrics: fidelity, diversity, and utility.
For more sensitive datasets, balancing fidelity and privacy becomes crucial. Our goal should be to maximize fidelity while preserving privacy.
领英推荐
One method to protect individual privacy in real-world datasets used for synthetic data is differential privacy. Differential privacy adds a small amount of random noise to the data, making it hard to identify any one person's information, while still maintaining the overall usefulness of the data. A real-world use case we interact with daily would be auto-complete for words and emojis within both Apple and Google devices. For optimal results, this method should mainly be used on massive datasets.?
The state of synthetic data
The current landscape for synthetic data is interesting. There’s a growing market demand for synthetic data, with startups and incumbents aiming to fill that need.?
This market can be broadly classified into two categories: structured and unstructured synthetic data generators.
I predict that the real value capture and competition will center around unstructured data. This prediction is based on the use cases derived from unstructured data, most of which will focus on training AI models.?
Advancements and challenges
Now that we understand the market structure, let's explore recent advancements in training AI using synthetic data and the associated challenges.
The adoption of synthetic data is rapidly growing in the field of generative AI, primarily through a concept called "bootstrapping." Bootstrapping involves training one model using data from another model. A typical example is using GPT-4 to train GPT-3.5 for specific tasks, like LLM evaluation.?
“Recent research has shown that training small, efficient language models (SLMs) on high-quality, diverse data can achieve state-of-the-art results- even rivaling or surpassing LLMs 5x the size such as Llama2-7b and Falcon-7b on common tasks, as demonstrated by models like Microsoft's "phi-1.5" (from their paper "Textbooks Are All You Need"), Orca2, and IBM's Granite. “
These small language models (SLMs) are paving the way for generating high-quality models using synthetic data, and this approach has the potential to scale to much larger models. Recent successes include Microsoft's Phi-2 and Google's ReST^EM.
Success in this field also brings its share of challenges, particularly within the realm of synthetic data. One crucial aspect is ensuring that synthetic data faithfully replicates real-world conditions. Failure to capture these complexities can lead to poor model performance in practical scenarios, which becomes challenging for complex data, like images.
Another significant concern voiced by skeptics of synthetic data is what's known as "mode collapse." This issue frequently arises when employing the GAN method mentioned earlier. Mode collapse occurs when an AI, originally designed to generate a wide range of diverse items, ends up repetitively producing a limited set of items instead of maintaining diversity. It's like a chef who only cooks a handful of dishes despite knowing a vast cookbook, thus earning the term "mode collapse" as the AI converges into a single mode.
Luckily, there are a variety of ways to fix most challenges with synthetic data. The key lies in ensuring the diversity of your data and continually validating it. Additionally, incorporating updated original data into the data generator on an ongoing basis helps maintain the high fidelity of synthetic data.
Throughout this post, I've argued that strategic data moats are diminishing while acknowledging the ongoing importance of human-generated "seed data." I understand this might seem contradictory. Currently, human-generated data plays a crucial role in training AI models, but its significance is expected to diminish over time. Let me provide you with two recent research findings that further support this trend, in case you aren't already convinced.
First, there’s MimicGen, which has demonstrated its ability to create high-quality synthetic data from small samples of real-world data. They successfully scaled up from 200 human demos to 50k synthetic demos. This research underscores the diminishing need for human-generated data.
Second, there's the concept of "no shot" or "zero-shot" synthetic data generation, where data can be generated without any initial real-world data. Rendered AI has hinted at its success with this approach on multiple occasions (see here and here).
In the end, if we want powerful AI incorporated into all aspects of our lives, then synthetic data is critical. The quantity and quality of our real world is not enough.??