Learnings from FineWeb
The fine folks @huggingface have just recently published their guide to building FineWeb, a fully-open source training dataset for LLMs it makes for a fun and educational read. Here's what I found interesting:
Extracting from Common Crawl
Should you use the default text extracted common crawl or take the raw data and extract yourself with trafilatura? turns out that doing it yourself - while more expensive - gives you better performance (red line).
"While the resulting dataset is about 25% larger for the WET data (around 254 billion tokens), it proves to be of much worse quality than the one that used trafilatura to extract text from WARC files (which is around 200 billion tokens). Visual inspection of some samples confirms that many of these additional tokens on the WET files are unnecessary page boilerplate."
Removing duplicates
Why is removing duplicates in the training dataset important? Better performance:
"Removing these duplicates (deduplicating) has been correlated with improvements in model performance and a reduction in memorization of pretraining data, which might allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to increased training efficiency: by removing duplicated content, a model can reach the same performance level with fewer training iterations – or equivalently, for a given number of training tokens, a model will have seen more diverse data"
But don't dedup too much!
"Initially, we were operating under the assumption that more deduplication is always better, so our first approach was to take the entire dataset (all 90+ dumps) and deduplicate them together as one big dataset using MinHash." "Deduplicating the dataset in this manner resulted in 4 trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion tokens subset, our ablation models showed next to no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below)."
Below is a great example of how LLM training is more like "find out through empirical exploration" than "I know what's going to happen because theory proves it"...kind of confusing when everyone talks about "scaling laws" etc these days.
"These results show that, for this older dump taken in isolation, the data that was kept (10% of the original data) was actually worse than the 90% of data we removed 12 . This is also confirmed by visual inspection: originally kept data contains far more ads, lists of keywords and generally badly formatted text than originally removed data."
FineWeb trained models versus other pretraining datasets
So how does a 1.8b model trained on different 350b token pretraining datasets compare to FineWeb?
领英推荐
FineWeb it is!
Using models to generate training data
The team went further by using llama-3-70b-instruct to annotate 500k samples from FineWeb, scoring each for their educational quality on a scale from 0 to 5. similar techniques were used in the llama3 paper and on phi3 from msft we wrote about this approach too in our recent post on data acquisition strategies for ai startups:
Findings:
CommonCrawl vintages
How useful are different vintages of commoncrawl dumps for model performance? Like wine, some years are better than others but why the inflection in performance from 2023 onwards? Could it be due to llm-generated synthetic data populating the web?
But how can you detect synthetic data on the web?
"there is no foolproof method to detect synthetic data, we opted to use a proxy metric: we measured the frequency of the following words in each crawl: "delve", "as a large language model", "it's important to note", "rich tapestry", "intertwined", "certainly!", "dive into", all of which are commonly used by ChatGPT.
Hard to be sure, but it looks like more synthetic data contamination in pretraining datasets doesn't harm the performance of resulting models - it might improve them:
Full post here: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
Venture Investor - CEO @ Meridian
4 个月would be great if there was a podcast or audio version of this one
Founder @ Worldsphere.ai | MBA, AI, Machine Learning
4 个月I was hoping someone would write a guide to this. Thank you Nathan Benaich. Nobody better than you to write it!