登录查看更多内容

The rise and fall of synthetic datasets and smaller language models

Thomas Wolf

Co-founder and Chief Science Officer at ?? Hugging Face – Angel investor

发布日期: 2024年8月18日

It’s Sunday morning we have some time with the coffee so let me tell you about some of our recent surprising journey in synthetic data and small language models.

This post is prompted by the coming release of an instant, in-browser model called SmolLM360 (link at the end)

The journey started as Loubna and Anton, two of the leads of the BigCode and StarCoder projects were looking for a new topic to explore. Around that time Microsoft had released Phi1, a small model (1.7B), trained for half on synthetic data, which showed very impressive code capabilities and which was followed by Phi1.5 extending this approach to natural language.

Benchmark number were really impressive but with the training dataset being kept private, people claimed that Phi1 was gaming the benchmark and maybe trained on very similar examples. Really intrigued by the performances and the secret here, Loubna and Anton decided to explore creating a large synthetic dataset. This led to the release of Cosmopedia 1 in winter 2024, 25B of synthetically created data by the best model at the time, Mixtral-8-7B and an associated model.

Performance were fine but still falling somehow short of what Phi-1 and Phi-1.5 were showing so they decided to go deeper.

A first breakthrough came in the spring when they dived in the audience of the synthetic prompt they were using. Let me explain a bit. In synthetic data generation, you ask a language model to general educative content on a topic of your choice. But educational content can span a very large spectrum of language and complexity, from content for toddlers up to graduate content for PhDs.

Just like there is no need to teach PhD-level concept to toddlers, many of the prompts they were first using were creating educational content that was far too complex for the small model they were training and focusing the target audience for the model on middle-school helped tremendously on many benchmarks (apart from MMLU which is typically PhD level as well).

A second and much more bittersweet discovery came later in the month as we were staying in a hotel in Lausanne in Switzerland, doing one of these remote ??get-together-to-work-in-a-nice-place?? we often have at Hugging Face. As a side project, to help Guilherme working on the release of a large dataset that would become FineWeb, they explored using similar prompt engineering techniques as they’d been using, but this time to filter a large chunk of the web, asking a LLM to rate the educational content of webpage instead of writing it from scratch.

FineWeb and FineWeb-Edu datasets benchmarks

Using this heavily filtered web data made performances directly going up and passing all the other models of similar size, including Phi-1.5, on most benchmarks. This was bitter sweet in that, while the performances were higher than we’d never seen, they had also spend so much time crafting synthetic data prompt to discover that heavily filtering the web was still better and much more diverse with more than 1.3 trillion tokens available even when filtering heavily (in comparison to the difficulty to scale the size and diversity of synthetic data).

Extending the same approach to code data, heavily filtering The Stack, the largest code dataset in the world, using prompt and language models also proved amazingly powerful, pushing the performances of a model which was stuck around 13% on HumanEval (a python code benchmark) up to above 20% out of the box. Boom!

领英推荐

SLM vs. LLM: The Battle of Languages Models

Codiste 10 个月前

Bridging the Reasoning Gap: How NLEPs Empower Large…

A Square Solution 8 个月前

RAG Explained: How to Enhance Large Language Models…

Jaypalsinh Jadeja 1 个月前

Filtering python codebase based on educational value

Are synthetic data still useful? Yes, but the web is so big and diverse that synthetic data really make more sense for some domain specific part where the right data is lacking, say reasoning or math.

Now, right as they were excited by these new discoveries and results, they were joined by a new intern Elie, who proved a great specialist of various trainings techniques and they decided to push the experiments to the limits in term of model size, going from 1.7B down to 360M and even 170M, aka the sizes of the old GPT1, BERT and GPT2, to see how small a model could be while still keeping good performances.

One of the recipes for these good performances proved to be simply training for longer and longer, ignoring the usual wisdom that dictated you should avoid training smaller models for too long. Right now even these very small model end up being trained on multiple trillions of tokens, just like they larger counterparts. Another element of the recipes they discovered was to anneal the data, which means keeping a special set of high quality training data for the last part of the training.

This lead to training last week a 360M models (this is more than 1000 times smaller than current frontier models like Llama-3.1-405B) which was showing amazing performances on the benchmarks, beating all <500M models and even some larger.

So what’s next? Alignment and benchmarks

SmolLM are starting to conquer the world

Given their small size these models are still struggling to answer very complex or graduate-level math/code questions. That’s perfectly fine because you don’t really need a model to be able to solve the math olympics in your daily life.

But one problem is that our evals usually contains a mix of complex and simpler questions leading to noise in how we evaluate these simple models.

Another problem is alignment, how to fine-tune these models to follow instructions. We’ve ben developing datasets and techniques which work really well for larger models (SFT, DPO, PPO, etc) but if you try the ??Instant Smol?? demo you’ll see that the aligned smog models are still lacking on this aspect and this comes likely from the alignment datasets for LLM which contained many concept too complex for small models (math, reasoning, etc) and lack simpler tasks that they are well designed for (grammar correction, translation, etc)

So what’s next for SmolLM? It’s going to be a really exciting year for them. A 360M parameters models is basically 360 MB size which is tiny in today’s web sizes (much smaller than many videos), it’s also basically instantaneous responses (>50-70 tok/s in browser) as it runs locally so with the knowledge around these models being progressively uncovered, I can see them being used everywhere more and more locally, with private data which don’t leave your computer, with instant response, with small size and more energy efficiency versus larger models.

An exciting year for Smol LMs ??

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Deepak Devassykutty

Full Stack Developer and Cloud Engineer | #AI , #LLM , #ChatBot , #Automation

6 个月

Great advice

Christoph Spanagel

Senior Data Scientist at SAP - building and integrating large scale AI models

6 个月

Giorgio Gro? :)

Plutoshift AI

6 个月

Insightful!

Shail Khiyara

6 个月

Thomas Wolf I agree that by training them longer and using high-quality data, these tiny models could still perform surprisingly well. We do that for our customers today at Plutoshift AI. Love to connect and share more. Small models, DO create a big impact. Think about saving $15M gallons of water a year, thousands of tons of chemicals, and the list goes on. https://www.dhirubhai.net/posts/shailkhiyara_sota-inference-languagemodels-activity-7195784880553644032-PARG?utm_source=share&utm_medium=member_desktop

1 次回应

Sudhir Gupta

Technology Evangelist supporting Social Enterprises and Startups.

6 个月

Wonderful.

查看更多评论

要查看或添加评论，请登录

Thomas Wolf的更多文章

Some notes on "DeepSeek and export control"

2025年1月30日

Some notes on "DeepSeek and export control"

Finally took time to go over Dario's essay on DeepSeek and export control and to be honest it was quite painful to…

34 条评论
Celebrating a crazy month of Open Multimodal LLM Releases

2024年9月29日

Celebrating a crazy month of Open Multimodal LLM Releases

If you haven't followed it several research labs have release impressively capable open multimodal LLM in September…

3 条评论
Some words on model repository security on the Hub

2024年3月4日

Some words on model repository security on the Hub

There some discussions about Hugging Face and model repository security (e.g.

2 条评论
Tips for open-sourcing research code

2020年1月14日

Tips for open-sourcing research code

I often meet research scientists and NLP practitioners interested in open-sourcing their code/research and asking for…

7 条评论
What happened in Natural language generation decoders in 2019?

2019年5月4日

What happened in Natural language generation decoders in 2019?

A lot of things happened in 2018/2019 for natural language generation decoding algorithms and I thought it was a good…

4 条评论

See all articles

The rise and fall of synthetic datasets and smaller language models

Thomas Wolf

Co-founder and Chief Science Officer at ?? Hugging Face – Angel investor

领英推荐

Thomas Wolf的更多文章

社区洞察

其他会员也浏览了

Does Fine-Tuning cause more Hallucinations, and how does cross-layer Attention reduce Key-Value Cache size?

Top LLM Papers of the Week (March Week-3 2024)

Understanding the Basic Components of a Prompt in LLM Models

Metrics That Matter: Measuring LLM Performance

The Art of Fine-Tuning Large Language Models, Explained in Depth

Large Language Models - part 2

How exactly LLM generates text?

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

Large language models

How to scale Large Language Models (LLMs) to infinite context?

领英推荐

Thomas Wolf的更多文章

Some notes on "DeepSeek and export control"

Celebrating a crazy month of Open Multimodal LLM Releases

Some words on model repository security on the Hub

Tips for open-sourcing research code

What happened in Natural language generation decoders in 2019?

社区洞察

其他会员也浏览了

Does Fine-Tuning cause more Hallucinations, and how does cross-layer Attention reduce Key-Value Cache size?

Top LLM Papers of the Week (March Week-3 2024)

Understanding the Basic Components of a Prompt in LLM Models

Metrics That Matter: Measuring LLM Performance

The Art of Fine-Tuning Large Language Models, Explained in Depth

Large Language Models - part 2

How exactly LLM generates text?

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

Large language models

How to scale Large Language Models (LLMs) to infinite context?