登录查看更多内容

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年9月29日

Today's paper introduces PROX (Programming Every Example), a new framework for refining pre-training data for large language models using smaller language models. The method generates and executes fine-grained operations to improve data quality at scale, leading to significant performance gains across various benchmarks while reducing computational costs. This approach offers a promising path for more efficient and effective language model pre-training.

Method Overview

PROX works by treating data refinement as a programming task. It uses a small language model (with as few as 0.3B parameters) to generate programs that refine individual examples in the pre-training corpus.

The process involves two main stages:

Document-level programming: The model decides whether to keep or discard entire documents.
Chunk-level programming: For retained documents, the model performs more fine-grained operations like removing specific lines or normalizing text.

These operations are represented as function calls (e.g., keep_doc(), remove_lines(), normalize()) which are then executed by a pre-defined executor to produce the refined corpus.

To adapt the small language model for this task, PROX first uses a larger language model (like LLAMA-3) to generate annotated examples. These examples are then used to fine-tune the smaller model, enabling it to generate appropriate refinement programs for new examples.

The refined corpus produced by PROX is then used to pre-train language models, resulting in improved performance across various downstream tasks.

领英推荐

DSPy: A New Framework - Program Your Foundation…

Clarifai 1 年前

Exploring Data Analytical Capabilities of Python: A…

Nuventure Connect 7 个月前

The Implications of Prompt Interfaces (Vol. 7)

KX 1 年前

Results

The paper demonstrates several key results:

Models pre-trained on PROX-curated data outperform those trained on original data or data filtered by other selection methods by more than 2% across various downstream benchmarks.
PROX is effective across different model sizes (from 0.3B to 1.7B parameters) and various pre-training corpora (including C4, RedPajama-V2, and FineWeb).
In domain-specific continual pre-training (e.g., for mathematical tasks), PROX-refined data leads to significant improvements. For instance, it improves average accuracy by 7.6% over MISTRAL-7B, 14.6% for LLAMA-2-7B, and 20.3% for CODELLAMA-7B, all within 10B tokens of training.
PROX significantly reduces the number of training FLOPs required to achieve comparable performance, offering a more efficient path for LLM pre-training.

Conclusion

PROX demonstrates that using small language models to refine pre-training data can lead to significant improvements in model performance and training efficiency. By enabling fine-grained, automated data refinement at scale, PROX offers a promising approach to enhance the quality of pre-training corpora and reduce the computational costs associated with training large language models. For more information please consult the?full paper.

Congrats to the authors for their work!

Zhou, Fan, et al. "Programming Every Example: Lifting Pre-Training Data Quality Like Experts at Scale." arXiv preprint arXiv:2409.17115 (2024).

AI Paper of the Day

1,306 位关注者

Fan Zhou

Master Student at Shanghai Jiao Tong Univ.

5 个月

Thank you for your kind words! ??

1 次回应

要查看或添加评论，请登录

Vlad Bogolin的更多文章

Unified Reward Model for Multimodal Understanding and Generation

2025年3月10日

Unified Reward Model for Multimodal Understanding and Generation

Today’s paper introduces UnifiedReward, the first unified reward model capable of evaluating both understanding and…
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

2025年3月9日

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Today's paper introduces Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) that achieves state-of-the-art…
START: Self-taught Reasoner with Tools

2025年3月8日

START: Self-taught Reasoner with Tools

Today's paper introduces START (Self-taught Reasoner with Tools), a novel approach that enhances large language models'…
Token-Efficient Long Video Understanding for Multimodal LLMs

2025年3月7日

Token-Efficient Long Video Understanding for Multimodal LLMs

Today's paper introduces STORM (Spatiotemporal Token Reduction for Multimodal LLMs), a novel architecture for efficient…
Predictive Data Selection: The Data That Predicts Is the Data That Teaches

2025年3月6日

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Today's paper introduces PRESELECT, a novel approach for selecting high-quality data for language model pretraining…
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

2025年3月5日

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Today's paper introduces MultiAgentBench, a comprehensive benchmark designed to evaluate Large Language Model (LLM)…
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

2025年3月4日

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Today's paper introduces Phi-4-Mini and Phi-4-Multimodal, two compact yet powerful language models. Phi-4-Mini is a 3.
How far can we go with ImageNet for Text-to-Image generation?

2025年3月3日

How far can we go with ImageNet for Text-to-Image generation?

Today's paper challenges the prevailing "bigger is better" paradigm in text-to-image generation by demonstrating that…
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

2025年3月2日

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Today's paper introduces DeltaBench, a comprehensive benchmark for evaluating the ability of Large Language Models…
Language Models' Factuality Depends on the Language of Inquiry

2025年3月1日

Language Models' Factuality Depends on the Language of Inquiry

Today's paper investigates an interesting limitation in multilingual language models (LMs): their inconsistency in…

See all articles

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

领英推荐

Results

Conclusion

AI Paper of the Day

1,306 位关注者

Vlad Bogolin的更多文章

社区洞察

其他会员也浏览了

Learn the Best Future Programming Languages

Embracing Python for AI Success: A Pathway to Innovation

AI Automation Coding: Harnessing the Power of Java and Python

From Syntax to Sentences: Is Natural Language the Future of Coding?

Using ChatGPT To Create Python Code For Patent Tasks

Introducing PromptLang: A simple prompt-based programming language specifically designed for use inside GPT-4 prompts

Python: The Unstoppable Rise in Artificial Intelligence Development

Does DSPy support multilingual tasks and how effective is it?

OpenAI API Calls - Is that the future of programming ?

We need to invent new programming languages to interact with LLMs

Method Overview

领英推荐

Results

Conclusion

AI Paper of the Day

1,306 位关注者

Vlad Bogolin的更多文章

Unified Reward Model for Multimodal Understanding and Generation

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

START: Self-taught Reasoner with Tools

Token-Efficient Long Video Understanding for Multimodal LLMs

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

How far can we go with ImageNet for Text-to-Image generation?

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Language Models' Factuality Depends on the Language of Inquiry

社区洞察

其他会员也浏览了

Learn the Best Future Programming Languages

Embracing Python for AI Success: A Pathway to Innovation

AI Automation Coding: Harnessing the Power of Java and Python

From Syntax to Sentences: Is Natural Language the Future of Coding?

Using ChatGPT To Create Python Code For Patent Tasks

Introducing PromptLang: A simple prompt-based programming language specifically designed for use inside GPT-4 prompts

Python: The Unstoppable Rise in Artificial Intelligence Development

Does DSPy support multilingual tasks and how effective is it?

OpenAI API Calls - Is that the future of programming ?

We need to invent new programming languages to interact with LLMs