Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
Today's paper introduces PROX (Programming Every Example), a new framework for refining pre-training data for large language models using smaller language models. The method generates and executes fine-grained operations to improve data quality at scale, leading to significant performance gains across various benchmarks while reducing computational costs. This approach offers a promising path for more efficient and effective language model pre-training.
Method Overview
PROX works by treating data refinement as a programming task. It uses a small language model (with as few as 0.3B parameters) to generate programs that refine individual examples in the pre-training corpus.
The process involves two main stages:
These operations are represented as function calls (e.g., keep_doc(), remove_lines(), normalize()) which are then executed by a pre-defined executor to produce the refined corpus.
To adapt the small language model for this task, PROX first uses a larger language model (like LLAMA-3) to generate annotated examples. These examples are then used to fine-tune the smaller model, enabling it to generate appropriate refinement programs for new examples.
The refined corpus produced by PROX is then used to pre-train language models, resulting in improved performance across various downstream tasks.
领英推荐
Results
The paper demonstrates several key results:
Conclusion
PROX demonstrates that using small language models to refine pre-training data can lead to significant improvements in model performance and training efficiency. By enabling fine-grained, automated data refinement at scale, PROX offers a promising approach to enhance the quality of pre-training corpora and reduce the computational costs associated with training large language models. For more information please consult the?full paper.
Congrats to the authors for their work!
Zhou, Fan, et al. "Programming Every Example: Lifting Pre-Training Data Quality Like Experts at Scale." arXiv preprint arXiv:2409.17115 (2024).
Master Student at Shanghai Jiao Tong Univ.
5 个月Thank you for your kind words! ??