New synthetic Text-to-SQL dataset, LLM training workshop, and more
Hello and welcome to the Gretel Epoch, a roundup of synthetic data developments, community highlights, and privacy-first generative AI insights from our team.
What We’re Building
Fine-Tuning CodeLlama with Synthetic Data on Amazon SageMaker. After releasing our state-of-the-art synthetic Text-to-SQL dataset, we successfully fine-tuned the CodeLlama-7B and CodeLlama-13B models, resulting in enhanced SQL query capabilities
v1.1 Release of Synthetic Text-to-SQL Dataset. We just released v1.1 of our synthetic Text-to-SQL dataset, the largest open-source dataset of its kind, designed to accelerate AI model training. This update addresses several suggestions from developers working with the dataset.
Gretel in the Wild
This week, we hosted an AI developer workshop with our friends at Predibase, demonstrating how synthetic data and efficient fine-tuning techniques can be leveraged to build high-performing LLMs quickly and cost-effectively. The recording is now available on our website.
Gretel’s co-founder and CEO, Ali Golshan spoke with Fast Company about the growing need for expert data in training more specialized LLMs: “We built really great general purpose machines that talk like humans, but just like humans [who] are not experts, they’re generalists. Now, what we’re saying is that these general purpose machines need to become experts. But “expert" training data is usually not public; it’s proprietary, held close by corporations. Gretel’s platform can be used to anonymize such data for use in training models."
We also sat down with the Washington Post to discuss AI’s current state: “If you compare a mature market to a mature tree, we’re just at the trunk. We’re at the genesis stage of AI.” ?
Other sightings: Gretel recently hosted a workshop with Google on automatically anonymizing sensitive customer data in financial documents, sat down for a fireside chat at Moonshot Capital’s Extraordinary Leadership Summit, and participated in a RSA Conference panel on how synthetic data can support security teams.
领英推荐
Upcoming events: The Gretel team is heading to Microsoft Build. Meet us at the Startup Hub to see how you can power your ML use cases with Gretel and the Azure family. If we don’t see you there, you can also connect with us at the Databricks' Data & AI Summit.
What We’re Reading
Microsoft released a new Phi-3 model series. Its developers focused entirely on the quality of data and the data-optimal regime. The result? A 3.8B model that outperforms llama 3 8B and the much larger Mixtral MoE 8x7B. From the Phi-3 technical report: "The innovation lies entirely in our dataset for training"
Andrew Ng on agentic workflows and synthetic data: “A significant barrier to using agentic workflows to produce LLM training data is the cost of generating tokens. Say we want to generate 1 trillion tokens to extend a pre-existing dataset. At current retail prices, 1 trillion tokens from GPT-4-turbo ($30 per million output tokens), Claude 3 Opus ($75), Gemini 1.5 Pro ($21), and Llama-3-70B on Groq ($0.79) would cost, respectively, $30M, $75M, $21M and $790K. Of course, an agentic workflow would require generating more than one token per final output token. But budgets for training cutting-edge LLMs easily surpass $100M, so spending a few million dollars more for data to boost performance is feasible. That’s why agentic workflows might opening up new opportunities for high-quality synthetic data generation
A new paper from Google DeepMind and Stanford University researchers underscores the growing importance of synthetic data in AI development. It highlights synthetic data use across various domains to scale systems, improve quality and diversity, reduce data acquisition costs
Anthropic’s Jack Clark on the important role of synthetic data in frontier AI development: "It's no longer a question of 'if' you should use synthetic data, but rather 'how much?'"
Thanks for reading. ?? If you have questions, or comments, join us in the Synthetic Data Discord community.
Gretel