Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
Credit: https://arxiv.org/pdf/2409.17115

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Today's paper introduces PROX (Programming Every Example), a new framework for refining pre-training data for large language models using smaller language models. The method generates and executes fine-grained operations to improve data quality at scale, leading to significant performance gains across various benchmarks while reducing computational costs. This approach offers a promising path for more efficient and effective language model pre-training.

Method Overview

PROX works by treating data refinement as a programming task. It uses a small language model (with as few as 0.3B parameters) to generate programs that refine individual examples in the pre-training corpus.

The process involves two main stages:

  1. Document-level programming: The model decides whether to keep or discard entire documents.
  2. Chunk-level programming: For retained documents, the model performs more fine-grained operations like removing specific lines or normalizing text.

These operations are represented as function calls (e.g., keep_doc(), remove_lines(), normalize()) which are then executed by a pre-defined executor to produce the refined corpus.

To adapt the small language model for this task, PROX first uses a larger language model (like LLAMA-3) to generate annotated examples. These examples are then used to fine-tune the smaller model, enabling it to generate appropriate refinement programs for new examples.

The refined corpus produced by PROX is then used to pre-train language models, resulting in improved performance across various downstream tasks.

Results

The paper demonstrates several key results:

  1. Models pre-trained on PROX-curated data outperform those trained on original data or data filtered by other selection methods by more than 2% across various downstream benchmarks.
  2. PROX is effective across different model sizes (from 0.3B to 1.7B parameters) and various pre-training corpora (including C4, RedPajama-V2, and FineWeb).
  3. In domain-specific continual pre-training (e.g., for mathematical tasks), PROX-refined data leads to significant improvements. For instance, it improves average accuracy by 7.6% over MISTRAL-7B, 14.6% for LLAMA-2-7B, and 20.3% for CODELLAMA-7B, all within 10B tokens of training.
  4. PROX significantly reduces the number of training FLOPs required to achieve comparable performance, offering a more efficient path for LLM pre-training.

Conclusion

PROX demonstrates that using small language models to refine pre-training data can lead to significant improvements in model performance and training efficiency. By enabling fine-grained, automated data refinement at scale, PROX offers a promising approach to enhance the quality of pre-training corpora and reduce the computational costs associated with training large language models. For more information please consult the?full paper.

Congrats to the authors for their work!

Zhou, Fan, et al. "Programming Every Example: Lifting Pre-Training Data Quality Like Experts at Scale." arXiv preprint arXiv:2409.17115 (2024).

Fan Zhou

Master Student at Shanghai Jiao Tong Univ.

5 个月

Thank you for your kind words! ??

要查看或添加评论,请登录

Vlad Bogolin的更多文章

社区洞察

其他会员也浏览了