Welcome to the Gretel Epoch—your monthly roundup of synthetic data news, applied research, and business insights on generating high-quality data that’s private by design.
? This cycle, we explore the power of synthetic code, new strategies for building privacy-compliant chatbots, and take time to reflect. ??
- Impact of Code in Pre-Training: Cohere’s research shows that adding just 10% synthetic code data led to a 9% boost in natural language reasoning and a 44% improvement in code performance over web-based training models. (Technical Report)
- OLMOE-1B-7B: Efficient Mixture-of-Experts (MoE) Model: This new MoE model offers a groundbreaking approach: a 7B-parameter architecture with only 1B active parameters per token. This design enables similar inference costs to TinyLlama 1B while training approximately 2x faster. The model’s unique dataset blend includes more code and math data, making it competitive with much larger models like Llama2-13B and DeepSeekMoE-16B. By using only 1B parameters per input, it balances performance and efficiency—making it competitive with larger models without the compute cost. (Technical Report)
- Synthesizing Inner Monologues: In a recent podcast interview, AI researcher Andrej Karpathy introduced the concept of 'silent collapse'—a subtle failure mode in models often overlooked. Karpathy suggests training models to mimic human-like inner monologues using techniques like reflection to capture detailed reasoning traces. “A billion of those, and AGI is here.”
- New Release: GSM8k Reflection Dataset: Our latest release in Gretel Open, a new dataset designed to push the boundaries of model reflection and explainability. We’re thrilled to release our new GSM8k Reflection Dataset—one of the first publicly available resources for experimenting with reflection-based AI training. This dataset enables models to simulate human-like thought processes, boosting reasoning capabilities and minimizing hallucinations. In our internal evaluations, reflection outperformed standard inference in 84% of AI reasoning tasks. Want to start building smarter models? Explore the blog, demo, and reflection dataset.
- LinkedIn’s Top 50 U.S. Startups: We’re excited to share that Gretel was named one of LinkedIn’s Top 50 U.S. Startups! This award celebrates emerging companies that are driving industry transformation and attracting top talent. Here's the list. If you’re passionate about building the future of synthetic data, we’re hiring—join our team.
- Privacy-Compliant Chatbots: Build privacy-preserving synthetic datasets for Retrieval Augmented Generation (RAG) workflows with Gretel and Databricks. This approach is ideal for enhancing chatbots in regulated industries like finance—enabling safe data use. (Tutorial)
- AI data bottlenecks: In our latest article in InfoWorld, we discuss three AI bottlenecks -- privacy, scarcity, and bias -- the bricks of the modern 'data wall,' and how synthetic data is a sledgehammer.
- Fine-Tuning with Ollama & Gretel: Learn how to fine-tune Llama 3.1 with Gretel’s synthetic text-to-SQL dataset for better text-to-code generation. By running training on GPUs and leveraging the Unsloth repository, you can streamline fine-tuning and deployment on SQL datasets with Ollama. (demo + dataset)
- Other Sightings: Discussed the power of compound AI systems for generating high-quality synthetic code at the KDD NL2Code workshop in Barcelona. Co-hosted a fireside chat with leaders from LinkedIn and Deutsche Bank on synthetic data’s role in the modern enterprise.
Master Privacy-First AI Development at Gretel University
Gretel University is a dedicated learning hub for mastering the design of privacy-first AI systems. Learn at your own pace, and discover the art and science of synthesizing safe data. We'll be adding new content regularly, but if you want to get started today, check out some of these lightning demos:
If you’re passionate about synthetic data or have questions about the Gretel Platform, join us in the Synthetic Data Discord community. Connect with 1,600+ developers, engineers, data scientists, and privacy advocates. Be the first to access new datasets, get exclusive tips from our expert staff, and collaborate on cool projects with the Gretel community.
Until next time -- move fast, but don't break things.
What about synthetic data that helps create synthetic code? What affect will synthetic data have on all aspects of LLMs?