Building Better Models Faster with Synthetic Data: The Future of Data-Centric AI
Imagined with Dalle

Building Better Models Faster with Synthetic Data: The Future of Data-Centric AI

In the final article of our GenAI Architecture Series, we focus on a pressing challenge for organizations and AI practitioners worldwide—data as a blocker to developing better models. Maarten Van Segbroeck, Head of Applied Science at Gretel, shared groundbreaking insights at the LLMOps Micro-Summit on how synthetic data can help overcome data scarcity and quality challenges. His discussion explored new techniques for generating high-quality synthetic data to fine-tune small language models (SLMs), making them more accurate, efficient, and adaptable for specific use cases. This article delves into how synthetic data is redefining AI and its transformative potential in shaping the future of machine learning.


The Data-Centric Wave: A New Era for AI Development

As Maarten aptly pointed out, we are moving into a data-centric wave of AI development. While the previous two decades saw significant advances in computational power and model architectures—from deep neural networks and GPUs in the 2000s to the explosion of Transformer architectures like BERT, GPT, and others in the 2010s—the focus is now shifting toward data quality. AI is reaching a critical juncture where the scarcity and limitations of high-quality, labeled data are becoming significant bottlenecks in model training.

Articles and discussions in AI communities increasingly highlight the growing concern that AI is running out of quality data. This sentiment is echoed by AI leaders like Sam Altman of OpenAI, who recently admitted that the world’s most valuable resource, data, is becoming harder to source and manage. Maarten described this trend: "More and more websites and open-source data communities are restricting access to their data, preventing it from being used for training large language models." As this trend continues, synthetic data is emerging as a viable solution to bypass these limitations and unlock new potentials in AI development.

The Problem: Data Challenges in AI Development

Organizations face numerous challenges when dealing with data for training AI models:

  1. Messy or Incomplete Data: Even when data is available, it often requires extensive cleaning and preprocessing, which can take weeks or months.
  2. Data Scarcity: Some organizations lack the necessary datasets altogether or have insufficient data for certain tasks.
  3. Data Privacy and Security: When dealing with sensitive data subject to regulations like GDPR and HIPAA, privacy and security concerns become a significant challenge. This often results in data silos and difficulties in efficiently utilizing available data.


Maarten highlighted that these challenges lead to many AI projects getting stuck or never reaching completion. "We'd like to solve that using synthetic data, generating synthetic data for them," he explained. Synthetic data, created by generative AI models, is increasingly seen as a powerful tool to tackle these problems.

What is Synthetic Data?

At its core, synthetic data is data generated by generative AI models as an alternative to real-world data. This data looks and feels like the original data but offers the advantage of being fully controllable. Organizations can generate synthetic data that closely resembles their real-world data while maintaining control over its characteristics, making it a versatile tool for downstream applications.

Synthetic data generation has gained traction among major tech companies like Databricks, Microsoft, and Google. Databricks has demonstrated that injecting high-quality synthetic data can reduce the overall amount of data needed for training models. Similarly, Microsoft and Google have used synthetic data to complement their foundational models, underscoring its importance in developing robust AI systems.

Gretel’s Approach: Tackling Data Scarcity and Privacy Challenges with Synthetic Data

Gretel has positioned itself as a leader in synthetic data solutions, particularly in the realm of large and small language models. The platform supports multiple data modalities, from free text and tabular data to time-series and high-dimensional data. While generating synthetic data, Gretel incorporates privacy-preserving techniques such as differential privacy, ensuring that synthetic datasets do not expose sensitive information from the original data.

Maarten outlined three typical scenarios where synthetic data is beneficial:

  1. No Starting Data: Organizations lacking initial datasets can generate data from scratch using Gretel.
  2. Incomplete or Sparse Data: For those with some data but lacking coverage, synthetic data can fill in the gaps, boost datasets, and enhance the quality of the training data.
  3. Sensitive Data: For organizations with sensitive data that cannot be used directly due to privacy concerns, Gretel offers solutions to generate differentially private synthetic data that mimics the original data without risking re-identification.

Synthetic Data Solutions for Language Models

Gretel’s unique offering, the Gretel Navigator Model, is a compound AI system comprising multiple agents and tools working together to generate high-quality datasets based on user prompts. It effectively combines LLMs with other tools and functions (such as Python functions) to compensate for LLMs' weaknesses—such as mathematical operations or generating unique identifiers.

One of Gretel's notable achievements is the creation of the world's largest open-source text-to-SQL dataset, which has become highly popular on Hugging Face. This dataset enables training LLMs to convert natural language queries into SQL commands, allowing seamless interaction with databases.

Maarten illustrated the advantages of using synthetic data for training SLMs by showcasing the model’s performance boost on the BIRD benchmark. When fine-tuned on Gretel’s synthetic dataset, SLMs like Meta LLaMA 3, Microsoft Phi, and Mistral showed significant accuracy improvements, reinforcing the value of synthetic data for building better models faster.

Synthetic Data with Differential Privacy

For highly sensitive data, Gretel provides differentially private synthetic data generation. This approach involves training models on real-world data while injecting noise to prevent the risk of re-identification. Maarten provided a compelling example from healthcare data, demonstrating how synthetic data can retain semantic value without compromising patient privacy.

Maarten stated, "Synthetic data is great, but you need to provide high-quality synthetic data. And that's - everyone can create synthetic data, but high-quality synthetic data is really something that's very valuable for us."

Quality Control in Synthetic Data Generation

One of the key concerns with synthetic data is the potential introduction of bias or hallucinations. During the Q&A session, Maarten addressed this issue:

Audience Question (Spark, Solution Architect): "Sometimes we heard another story that we are afraid that when we synthesize data, we're going to introduce some bias or maybe make the hallucination even worse. How do you control the quality to make sure it's going in the right direction?"

Answer (Maarten): "Synthetic data is not always a replacement for your real data. Merging real data with synthetic data is always a good idea... Techniques like LLM as a judge, where you rely on another LLM to assess the quality of the synthetic data, that's also important. High-quality synthetic data is something that's very valuable for us."

Maarten's answer highlights the importance of rigorous evaluation and quality control methods, such as using another LLM as a judge to validate synthetic data.

Strategic Actions for Executives and CTOs

As we conclude our series on GenAI Architecture, it is crucial to consider the strategic actions needed to leverage synthetic data effectively:

  1. Adopt Synthetic Data Solutions: Consider incorporating synthetic data generation into your AI development pipeline to overcome data scarcity, privacy concerns, and data quality challenges.
  2. Combine Real and Synthetic Data: Ensure high-quality outcomes by blending synthetic data with real-world data, thereby reducing biases and improving model robustness.
  3. Invest in Advanced Evaluation Tools: Use advanced evaluation tools and techniques such as LLM as a judge to ensure that synthetic data meets high-quality standards.
  4. Explore Differential Privacy: For sensitive data, explore differential privacy techniques to create safe and effective synthetic datasets that maintain data utility without compromising privacy.
  5. Collaborate Across Teams: Foster collaboration between data scientists, AI practitioners, and domain experts to ensure that synthetic data aligns with real-world applications and maintains relevance.
  6. Stay Ahead of the Curve: Keep abreast of emerging trends in synthetic data generation and fine-tuning methods to stay competitive in the rapidly evolving AI landscape.

Looking Ahead

The future of AI development lies in leveraging synthetic data to build better models faster. By embracing innovative techniques and ensuring robust data quality controls, organizations can unlock new levels of performance and efficiency in their AI systems. As Gary Marcus noted, "people had the illusion that you could infinitely make large language models better by just using more and more data, but now they've basically used all the data they can." As natural data becomes scarce, synthetic data will be the key to unlocking the next wave of AI advancements.

By understanding the role of synthetic data and integrating it strategically, businesses can position themselves at the forefront of AI innovation. Let's build a future where AI development is both ethical and groundbreaking, leveraging the power of synthetic data for the next generation of intelligent systems.

Imagined with MidJourney


要查看或添加评论,请登录

Robert Schwentker的更多文章

社区洞察

其他会员也浏览了