The rise and fall of synthetic datasets and smaller language models

The rise and fall of synthetic datasets and smaller language models

It’s Sunday morning we have some time with the coffee so let me tell you about some of our recent surprising journey in synthetic data and small language models.

This post is prompted by the coming release of an instant, in-browser model called SmolLM360 (link at the end)

The journey started as Loubna and Anton, two of the leads of the BigCode and StarCoder projects were looking for a new topic to explore. Around that time Microsoft had released Phi1, a small model (1.7B), trained for half on synthetic data, which showed very impressive code capabilities and which was followed by Phi1.5 extending this approach to natural language.

Benchmark number were really impressive but with the training dataset being kept private, people claimed that Phi1 was gaming the benchmark and maybe trained on very similar examples. Really intrigued by the performances and the secret here, Loubna and Anton decided to explore creating a large synthetic dataset. This led to the release of Cosmopedia 1 in winter 2024, 25B of synthetically created data by the best model at the time, Mixtral-8-7B and an associated model.


Phi 1 and phi 1.5 benchmarks

Performance were fine but still falling somehow short of what Phi-1 and Phi-1.5 were showing so they decided to go deeper.

A first breakthrough came in the spring when they dived in the audience of the synthetic prompt they were using. Let me explain a bit. In synthetic data generation, you ask a language model to general educative content on a topic of your choice. But educational content can span a very large spectrum of language and complexity, from content for toddlers up to graduate content for PhDs.

US educational system

Just like there is no need to teach PhD-level concept to toddlers, many of the prompts they were first using were creating educational content that was far too complex for the small model they were training and focusing the target audience for the model on middle-school helped tremendously on many benchmarks (apart from MMLU which is typically PhD level as well).

A second and much more bittersweet discovery came later in the month as we were staying in a hotel in Lausanne in Switzerland, doing one of these remote ??get-together-to-work-in-a-nice-place?? we often have at Hugging Face. As a side project, to help Guilherme working on the release of a large dataset that would become FineWeb, they explored using similar prompt engineering techniques as they’d been using, but this time to filter a large chunk of the web, asking a LLM to rate the educational content of webpage instead of writing it from scratch.

FineWeb and FineWeb-Edu datasets benchmarks

Using this heavily filtered web data made performances directly going up and passing all the other models of similar size, including Phi-1.5, on most benchmarks. This was bitter sweet in that, while the performances were higher than we’d never seen, they had also spend so much time crafting synthetic data prompt to discover that heavily filtering the web was still better and much more diverse with more than 1.3 trillion tokens available even when filtering heavily (in comparison to the difficulty to scale the size and diversity of synthetic data).

Extending the same approach to code data, heavily filtering The Stack, the largest code dataset in the world, using prompt and language models also proved amazingly powerful, pushing the performances of a model which was stuck around 13% on HumanEval (a python code benchmark) up to above 20% out of the box. Boom!

Filtering python codebase based on educational value

Are synthetic data still useful? Yes, but the web is so big and diverse that synthetic data really make more sense for some domain specific part where the right data is lacking, say reasoning or math.

Now, right as they were excited by these new discoveries and results, they were joined by a new intern Elie, who proved a great specialist of various trainings techniques and they decided to push the experiments to the limits in term of model size, going from 1.7B down to 360M and even 170M, aka the sizes of the old GPT1, BERT and GPT2, to see how small a model could be while still keeping good performances.

One of the recipes for these good performances proved to be simply training for longer and longer, ignoring the usual wisdom that dictated you should avoid training smaller models for too long. Right now even these very small model end up being trained on multiple trillions of tokens, just like they larger counterparts. Another element of the recipes they discovered was to anneal the data, which means keeping a special set of high quality training data for the last part of the training.

This lead to training last week a 360M models (this is more than 1000 times smaller than current frontier models like Llama-3.1-405B) which was showing amazing performances on the benchmarks, beating all <500M models and even some larger.

So what’s next? Alignment and benchmarks


SmolLM are starting to conquer the world

Given their small size these models are still struggling to answer very complex or graduate-level math/code questions. That’s perfectly fine because you don’t really need a model to be able to solve the math olympics in your daily life.

But one problem is that our evals usually contains a mix of complex and simpler questions leading to noise in how we evaluate these simple models.

Another problem is alignment, how to fine-tune these models to follow instructions. We’ve ben developing datasets and techniques which work really well for larger models (SFT, DPO, PPO, etc) but if you try the ??Instant Smol?? demo you’ll see that the aligned smog models are still lacking on this aspect and this comes likely from the alignment datasets for LLM which contained many concept too complex for small models (math, reasoning, etc) and lack simpler tasks that they are well designed for (grammar correction, translation, etc)

So what’s next for SmolLM? It’s going to be a really exciting year for them. A 360M parameters models is basically 360 MB size which is tiny in today’s web sizes (much smaller than many videos), it’s also basically instantaneous responses (>50-70 tok/s in browser) as it runs locally so with the knowledge around these models being progressively uncovered, I can see them being used everywhere more and more locally, with private data which don’t leave your computer, with instant response, with small size and more energy efficiency versus larger models.

An exciting year for Smol LMs ??

Deepak Devassykutty

Full Stack Developer and Cloud Engineer | #AI , #LLM , #ChatBot , #Automation

3 个月

Great advice

回复
Christoph H?rtnagl

Senior Data Scientist at SAP - building and integrating large scale AI models

3 个月
回复
回复
Shail Khiyara

Top AI Voice | Founder, CEO | Author | Board Member | Gartner Peer Ambassador | Speaker | Bridge Builder

3 个月

Thomas Wolf I agree that by training them longer and using high-quality data, these tiny models could still perform surprisingly well. We do that for our customers today at Plutoshift AI. Love to connect and share more. Small models, DO create a big impact. Think about saving $15M gallons of water a year, thousands of tons of chemicals, and the list goes on. https://www.dhirubhai.net/posts/shailkhiyara_sota-inference-languagemodels-activity-7195784880553644032-PARG?utm_source=share&utm_medium=member_desktop

Sudhir Gupta

Technology Evangelist supporting Social Enterprises and Startups.

3 个月

Wonderful.

回复

要查看或添加评论,请登录

Thomas Wolf的更多文章

社区洞察

其他会员也浏览了