Measuring Privacy Risks and Preventing Model Collapse
On the art and science of synthetic data generation.

Measuring Privacy Risks and Preventing Model Collapse

Welcome to the Gretel Epoch, a roundup of synthetic data news, product developments, and practical insights into generating high-quality data that is private by design.

Synthetics in the News

Synthetic data has been making headlines recently. Here are a few stories we’re following:

?? Hugging Face: "Imagine you are a developer in a large investment firm tasked with monitoring economic news sentiment toward companies in your investment portfolio. Until recently, you had two main options:

You could fine-tune your own model. This requires writing annotation instructions, creating an annotation interface, recruiting (crowd) workers, introducing quality assurance measures to handle low-quality data, fine-tuning a model on this data, and deploying it.

Or you could send your data with instructions to an LLM API. You skip fine-tuning and deployment entirely, and you reduce the data analysis process to writing instructions (prompts), which you send to an “LLM annotator” behind an API. In this case, the LLM API is your final inference solution and you use the LLM's outputs directly for your analysis.

Although Option 2 is more expensive at inference time and requires you to send sensitive data to a third party, it is significantly easier to set up than Option 1 and, therefore, used by many developers.

In 2024, synthetic data provides a third option: combining the cost benefits of Option 1 with the ease-of-use of Option 2. Simply put, you can use an LLM (the “teacher”) to annotate a small sample of data for you, and then you fine-tune a smaller, more efficient LM (the “student”) on this data."

The result: fine-tuning a custom small language model using synthetic data costs around $2.7 to fine-tune, compared to $3,061 with GPT-4 on real-world data, while emitting significantly less CO2 and offering faster inference speeds.

Figure 1. Comparing approaches to model development.

??Towards Data Science: "The study’s methodology does not account for the continuous influx of new, diverse data that characterizes real-world AI model training. This limitation may lead to an overestimation of model collapse in practical scenarios, where fresh data serves as a potential corrective mechanism against degradation."

?? Business Insider: “Why is synthetic data better than raw public data? Raw data is just that: raw. It's often filled with holes, inconsistencies, and biases from the processes used to capture, label, and leverage it. Synthetic data, on the other hand, allows us to fill those gaps, expand into areas that can't be captured in the wild, and intentionally design the data needed for specific applications. This level of control, with humans in the loop designing and refining the data, is crucial for pushing GenAI to new heights in a responsible, transparent, and secure manner.”?

??? United Nations University: “The Global South often faces a 'data deficit,' limited data availability due to factors like resource constraints, uneven resource distribution and representation, and underdeveloped data infrastructures. Synthetic data can help bridge these gaps by providing additional and diverse data for analysis and model training. This is especially critical for digital inclusion by building robust and fair AI-enabled systems to serve their populations in today’s data-driven governance and economy, where a substantive number of their vulnerable populations remain invisible or misrepresented in such models.”

?? Forrester: “Synthetic data providers are emerging to democratize AI training — and their solutions are not limited to computer vision systems. Synthetic data provider Gretel, for example, released the world’s largest open-source text-to-SQL synthetic dataset to assist developers in training their models via tabular data.”

?? The Economist: “It is clear that access to more data—whether culled from specialist sources, generated synthetically or provided by human experts—is key to maintaining rapid progress in ai. Like oilfields, the most accessible data reserves have been depleted. The challenge now is to find new ones—or sustainable alternatives.”

What We're Building

Here's what Gretel's Applied Science and Product teams have been cooking up:

Figure 2. Comparing Gretel's compound AI?generation approach to synthesizing data vs frontier models and human expert curated data.

?? High-Quality Synthetics for Fine-Tuning LLMs: Gretel Navigator’s synthetic data outperformed OpenAI's GPT-4 by 25.6%, Llama3-70b by 48.1%, and human expert-curated data by 73.6%. Watch this video to learn more.

?? Navigator Fine-Tuning Now Generally Available: We’re thrilled to announce that fine-tuning capabilities have been added to Navigator, our privacy-preserving compound AI system. Developers can now safely design tailor-made data solutions for their AI projects.

Figure 3. A Walkthrough of Gretel's Synthetic Data Scoring System & New Privacy Risk Scores.

?? New Privacy Risk Scores for Synthetic Tabular Data: The lack of industry-wide metrics for evaluating privacy risks, combined with research showing the insufficiency of traditional de-identification methods, has led Gretel to pioneer a robust, standardized approach. Our new privacy risk scoring system sets a new standard for safe AI development. Learn more about our data quality evaluation and privacy protection report.

?? New Synthetic Multilingual Prompts Dataset: We’ve released a new open synthetic dataset featuring 1,250 prompts in 7 languages, generated using Gretel Navigator. This resource is designed to enhance LLM interactions and is inspired by the popular awesome-chatGPT-prompts collection.

?? Gretel Demo Day: For a deeper dive into our latest platform updates, watch our recent Gretel Demo Day, featuring insights from our Applied Science and Product teams.

Gretel in the Wild

?? LLMOps Micro Summit: Missed the event? We joined experts from Apple, Checkr, Galileo, and Predibase to discuss Small Language Models (SLMs), rapid inference, and building better models with synthetic data. Watch Gretel’s Head of Applied Science Maarten Van Segbroeck’s presentation, or check out the full summit playlist.

??? Upcoming Event: We’ll be at Big Data London on Sept 18-19, booth #Y650. Come say hello!

?? Other Sightings: We recently explored the rise of Small Language Models (SLMs) and the confusion around licensing, collaborated with Daitiku, hosted a workshop with Lambda Labs, presented at the OpenDP Community Meetup, shared our insights with WIRED on the future of open source, and were featured in Microsoft’s AI Digest and Forbes.


If you have questions or comments, join us in the Synthetic Data Discord community. For those looking to develop their synthetic data expertise, Gretel University is now live—your ultimate hub for mastering synthetic data, with insightful videos and curated expert resources.

Go forth and synthesize.

—The Gretel Team

要查看或添加评论,请登录

社区洞察