Synthetic Data won AI Mathematical Olympiad 2024 Progress prize!
-> Good data is all you need!
-> Note: I'd be integrating the data creation pipeline in https://lnkd.in/gd2siUtP soon.
Overall system used by NuminaMath for winning:
1. Used deepseek_ai math base 7B model
2. Custom synthetic datasets:
> Stage 1: 100k+ CoT dataset from Math PDFs
> Stage 2: 60k problems (Python REPL), GPT-4 generated w/ ToRA
3. Two-stage fine-tuning (MuMath-Code paper), no LoRA/DoRA, packed to 2048:
> Stage 1: SFT on CoT data
> Stage 2: SFT on TIR w/ Python REPL
4. Quantized Model to 8-bit for T4 GPUs
5. SC-TIR for inference:
> Generate 48 candidates w/ Python REPL
> Repeat gen/exec up to 3x for errors
> Majority vote for final answer
6 Multiple validation sets to avoid overfitting.
Synthetic dataset generation:
1. Collect diverse dataset of natural language math problems and solutions -> Template each solution with CoT for reasoning. Base model is fine-tuned on this.
2. Then, a synthetic dataset with tool-integrated reasoning is created. Each math problem is decomposed into a sequence of rationales, Python programs, and their outputs. This is same as in Microsoft’s ToRA paper. GPT-4 was prompted to produce solutions in the ToRA format with code execution feedback.?Then, fine-tuned the stage1 model.
#AI #Mathematics #DataScience #Innovation #DataCreation #ArtificialIntelligence #SyntheticData