登录查看更多内容

New synthetic Text-to-SQL dataset, LLM training workshop, and more

Gretel

The synthetic data platform purpose-built for AI

发布日期: 2024年5月17日

Hello and welcome to the Gretel Epoch, a roundup of synthetic data developments, community highlights, and privacy-first generative AI insights from our team.

What We’re Building

Fine-Tuning CodeLlama with Synthetic Data on Amazon SageMaker. After releasing our state-of-the-art synthetic Text-to-SQL dataset, we successfully fine-tuned the CodeLlama-7B and CodeLlama-13B models, resulting in enhanced SQL query capabilities and nearly 40% improvement in model performance on the Bird-Benchmark EX and VES scores.

v1.1 Release of Synthetic Text-to-SQL Dataset. We just released v1.1 of our synthetic Text-to-SQL dataset, the largest open-source dataset of its kind, designed to accelerate AI model training. This update addresses several suggestions from developers working with the dataset.

Gretel in the Wild

This week, we hosted an AI developer workshop with our friends at Predibase, demonstrating how synthetic data and efficient fine-tuning techniques can be leveraged to build high-performing LLMs quickly and cost-effectively. The recording is now available on our website.

Gretel’s co-founder and CEO, Ali Golshan spoke with Fast Company about the growing need for expert data in training more specialized LLMs: “We built really great general purpose machines that talk like humans, but just like humans [who] are not experts, they’re generalists. Now, what we’re saying is that these general purpose machines need to become experts. But “expert" training data is usually not public; it’s proprietary, held close by corporations. Gretel’s platform can be used to anonymize such data for use in training models."

We also sat down with the Washington Post to discuss AI’s current state: “If you compare a mature market to a mature tree, we’re just at the trunk. We’re at the genesis stage of AI.” ?

Other sightings: Gretel recently hosted a workshop with Google on automatically anonymizing sensitive customer data in financial documents, sat down for a fireside chat at Moonshot Capital’s Extraordinary Leadership Summit, and participated in a RSA Conference panel on how synthetic data can support security teams.

领英推荐

Knowledge Graphs And Machine Learning - The Future of…

Bernard Marr 5 年前

10 Steps to Become a More Responsible Data Scientist

Open Data Science Conference (ODSC) 2 年前

AWS re: Invent’23 Day 4- Tectonic Shifts in Technology

CloudThat 1 年前

Upcoming events: The Gretel team is heading to Microsoft Build. Meet us at the Startup Hub to see how you can power your ML use cases with Gretel and the Azure family. If we don’t see you there, you can also connect with us at the Databricks' Data & AI Summit.

What We’re Reading

Microsoft released a new Phi-3 model series. Its developers focused entirely on the quality of data and the data-optimal regime. The result? A 3.8B model that outperforms llama 3 8B and the much larger Mixtral MoE 8x7B. From the Phi-3 technical report: "The innovation lies entirely in our dataset for training"

Andrew Ng on agentic workflows and synthetic data: “A significant barrier to using agentic workflows to produce LLM training data is the cost of generating tokens. Say we want to generate 1 trillion tokens to extend a pre-existing dataset. At current retail prices, 1 trillion tokens from GPT-4-turbo ($30 per million output tokens), Claude 3 Opus ($75), Gemini 1.5 Pro ($21), and Llama-3-70B on Groq ($0.79) would cost, respectively, $30M, $75M, $21M and $790K. Of course, an agentic workflow would require generating more than one token per final output token. But budgets for training cutting-edge LLMs easily surpass $100M, so spending a few million dollars more for data to boost performance is feasible. That’s why agentic workflows might opening up new opportunities for high-quality synthetic data generation.”

A new paper from Google DeepMind and Stanford University researchers underscores the growing importance of synthetic data in AI development. It highlights synthetic data use across various domains to scale systems, improve quality and diversity, reduce data acquisition costs, and potentially enable self-improvement capabilities.

Anthropic’s Jack Clark on the important role of synthetic data in frontier AI development: "It's no longer a question of 'if' you should use synthetic data, but rather 'how much?'"

Thanks for reading. ?? If you have questions, or comments, join us in the Synthetic Data Discord community.

Gretel

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

New synthetic Text-to-SQL dataset, LLM training workshop, and more

Gretel

The synthetic data platform purpose-built for AI

What We’re Building

Gretel in the Wild

领英推荐

What We’re Reading

The Gretel Epoch

6,100 位关注者

Gretel的更多文章

社区洞察

其他会员也浏览了

Issue #264 - The ML Engineer ??

Issue #263 - The ML Engineer ??

Evaluating ML Models with Azure, Preventing AI Failure, and Interactive Pipelines

Understanding Retrieval-Augmented Generation (RAG) in Azure AI

Why Data Labeling is Crucial for Machine Learning: 7 Benefits You Need to Know

IID in machine learning

Hyperparameter Optimization, Achieving Responsible AI, and How to Hire Data Scientists

Building Generative AI applications with Databricks

Is Databricks + MosaicML now competing with OpenAI, Vertex, Azure and Bedrock?

Strategies for Improving Machine Learning Algorithms: Tips & Tricks

What We’re Building

Gretel in the Wild

领英推荐

What We’re Reading

The Gretel Epoch

6,100 位关注者

Gretel的更多文章

DeepSeek Synthetic Data Lessons + Flywheels, RAGs, and Other Breadcrumbs

Gretel Predicts 2025

AWS & Azure Integrations, Privacy Innovations, and More

Synthetic Code, Compliant Chatbots, and Reflection Data

Measuring Privacy Risks and Preventing Model Collapse

Customizing Secure Datasets for RAG & LM Development

Scaling AI: Why High-Quality Synthetic Data is Key

Synthesizing SageMaker, Anonymizing Financial Data, & Evaluating RAG Models

Differentially Private Synthetic Text Data, Enhancing RAG Models, and much more

Textbook-Quality Synthetic Data, RAG Workshop, Philly Meetup & More

社区洞察

其他会员也浏览了

Issue #264 - The ML Engineer ??

Issue #263 - The ML Engineer ??

Evaluating ML Models with Azure, Preventing AI Failure, and Interactive Pipelines

Understanding Retrieval-Augmented Generation (RAG) in Azure AI

Why Data Labeling is Crucial for Machine Learning: 7 Benefits You Need to Know

IID in machine learning

Hyperparameter Optimization, Achieving Responsible AI, and How to Hire Data Scientists

Building Generative AI applications with Databricks

Is Databricks + MosaicML now competing with OpenAI, Vertex, Azure and Bedrock?

Strategies for Improving Machine Learning Algorithms: Tips & Tricks