登录查看更多内容

Learnings from FineWeb

Nathan Benaich

learning step-by-step

发布日期: 2024年6月2日

The fine folks @huggingface have just recently published their guide to building FineWeb, a fully-open source training dataset for LLMs it makes for a fun and educational read. Here's what I found interesting:

Extracting from Common Crawl

Should you use the default text extracted common crawl or take the raw data and extract yourself with trafilatura? turns out that doing it yourself - while more expensive - gives you better performance (red line).

"While the resulting dataset is about 25% larger for the WET data (around 254 billion tokens), it proves to be of much worse quality than the one that used trafilatura to extract text from WARC files (which is around 200 billion tokens). Visual inspection of some samples confirms that many of these additional tokens on the WET files are unnecessary page boilerplate."

Removing duplicates

Why is removing duplicates in the training dataset important? Better performance:

"Removing these duplicates (deduplicating) has been correlated with improvements in model performance and a reduction in memorization of pretraining data, which might allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to increased training efficiency: by removing duplicated content, a model can reach the same performance level with fewer training iterations – or equivalently, for a given number of training tokens, a model will have seen more diverse data"

But don't dedup too much!

"Initially, we were operating under the assumption that more deduplication is always better, so our first approach was to take the entire dataset (all 90+ dumps) and deduplicate them together as one big dataset using MinHash." "Deduplicating the dataset in this manner resulted in 4 trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion tokens subset, our ablation models showed next to no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below)."

Below is a great example of how LLM training is more like "find out through empirical exploration" than "I know what's going to happen because theory proves it"...kind of confusing when everyone talks about "scaling laws" etc these days.

"These results show that, for this older dump taken in isolation, the data that was kept (10% of the original data) was actually worse than the 90% of data we removed 12 . This is also confirmed by visual inspection: originally kept data contains far more ads, lists of keywords and generally badly formatted text than originally removed data."

FineWeb trained models versus other pretraining datasets

So how does a 1.8b model trained on different 350b token pretraining datasets compare to FineWeb?

Lior Sidi 2 周前

RAG's using LangChain : Part 1-Document Loaders

Shanza Khan 1 个月前

Fine-tuning open-source LLM on a custom dataset for…

Denis Smyslov 3 个月前

FineWeb it is!

Using models to generate training data

The team went further by using llama-3-70b-instruct to annotate 500k samples from FineWeb, scoring each for their educational quality on a scale from 0 to 5. similar techniques were used in the llama3 paper and on phi3 from msft we wrote about this approach too in our recent post on data acquisition strategies for ai startups:

Findings:

"FineWeb-Edu surpasses FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.
It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.
This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering."

CommonCrawl vintages

How useful are different vintages of commoncrawl dumps for model performance? Like wine, some years are better than others but why the inflection in performance from 2023 onwards? Could it be due to llm-generated synthetic data populating the web?

But how can you detect synthetic data on the web?

"there is no foolproof method to detect synthetic data, we opted to use a proxy metric: we measured the frequency of the following words in each crawl: "delve", "as a large language model", "it's important to note", "rich tapestry", "intertwined", "certainly!", "dive into", all of which are commonly used by ChatGPT.

Hard to be sure, but it looks like more synthetic data contamination in pretraining datasets doesn't harm the performance of resulting models - it might improve them:

Full post here: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1

John J. Villa

Venture Investor - CEO @ Meridian

4 个月

would be great if there was a podcast or audio version of this one

Eduardo Siman

Founder @ Worldsphere.ai | MBA, AI, Machine Learning

4 个月

I was hoping someone would write a guide to this. Thank you Nathan Benaich. Nobody better than you to write it!

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Learnings from FineWeb

Nathan Benaich

learning step-by-step

Extracting from Common Crawl

Removing duplicates

FineWeb trained models versus other pretraining datasets

领英推荐

Using models to generate training data

CommonCrawl vintages

更多精彩文章

社区洞察

其他会员也浏览了

Fine-tuning open-source LLM on a custom dataset for Information Retrieval. Part 2

GRADING YOUR MACHINE LEARNING PREDICTIONS

LLM part 4

Retrieval-Augmented Generation (RAG)

Getting Started with the Titanic Data Set

Document Retrieval in Brief

Six Ways to Harden Your Model-Serving API with Tests & Scans

RAG using LangChain : Part 4-Retrievers/Vectore Store

Basic Concepts With Q & A in Data Science!!

UCOVI Newsletter - April 2023

Extracting from Common Crawl

Removing duplicates

FineWeb trained models versus other pretraining datasets

领英推荐

Using models to generate training data

CommonCrawl vintages

my reflections on raais 2024

2024年6月17日

How to find a unicorn

2024年6月12日

Issues with the Future Fund: Breakthrough assessment...

2024年3月26日

NVIDIA GTC vs. Wall Street - my 8 year retrospective

2024年3月17日

UK will miss AI goldrush unless Government adopts a more positive vision

2024年2月2日

A New National Purpose: Leading the Biotech Revolution

2024年1月25日

ESG or procurement? Which is the bigger rate limiter for European defense companies?

2024年1月10日

Some things tech folks should know before building in bio

2023年12月30日

Welcome to State of AI Report 2021

2021年11月11日

News in artificial intelligence and machine learning

2015年11月6日

社区洞察

其他会员也浏览了

Fine-tuning open-source LLM on a custom dataset for Information Retrieval. Part 2

GRADING YOUR MACHINE LEARNING PREDICTIONS

LLM part 4

Retrieval-Augmented Generation (RAG)

Getting Started with the Titanic Data Set

Document Retrieval in Brief

Six Ways to Harden Your Model-Serving API with Tests & Scans

RAG using LangChain : Part 4-Retrievers/Vectore Store

Basic Concepts With Q & A in Data Science!!

UCOVI Newsletter - April 2023