DeepSeek-R1: A Pure RL-based Reasoning Model
Jayant Kumar
Principal ML Scientist at Adobe | Technical Advisor at Preffect | Multimodal AI | Large language models and Knowledge Graph applications
I summarize the key steps involved in creating the DeepSeek models, from the foundational development of DeepSeek-R1 to the distillation process that led to the creation of the DeepSeek-R1-Distill-Qwen models.
Available here: https://ollama.com/library/deepseek-r1
Training DeepSeek-R1-Zero
Begin with the DeepSeek-V3-Base model as the foundation, which can begin self-evolution via RL without needing curated datasets.
Employ Group Relative Policy Optimization (GRPO), which uses group rewards instead of a critic model to optimize efficiency. Generate groups of outputs for each input question to calculate group rewards and train the model. Ex: For a math problem, sampling multiple answers per question and optimizing rewards ensures a diverse understanding of solutions.
Use accuracy rewards to evaluate the correctness of deterministic tasks (e.g., math problems) based on structured verification methods. Incorporate format rewards to ensure generated content adheres to a clear reasoning process format. This encourages correctness and format consistency. For example, a reasoning problem like "Prove the Pythagorean theorem" would reward both correct proof steps and clear structure.
Structure outputs using a specific template: <think> reasoning process </think> <answer> final answer </answer>. This provides structured outputs, improving readability and debugging.
Allow the model to evolve its reasoning capabilities naturally by iteratively refining its predictions through RL, focusing on complex reasoning tasks like coding and mathematics. The model learns to chain reasoning steps for complex tasks like solving quadratic equations.
Continuously evaluate model performance on reasoning benchmarks (e.g., AIME 2024) during RL training to monitor improvements.
Training DeepSeek-R1
Collect thousands of high-quality reasoning examples (long Chain-of-Thoughts, or CoTs) curated through few-shot prompting, human annotation, and refinement of DeepSeek-R1-Zero outputs. Fine-tune the DeepSeek-V3-Base model with this dataset to initialize the RL actor, addressing readability and language consistency issues. This overcomes early instability in RL by starting with readable, curated data.
Conduct reasoning-specific RL using a reward mechanism that combines: Accuracy rewards (for correctness in coding/math), Language consistency rewards (to maintain coherence and prevent language mixing). Train until the model converges on reasoning tasks. This enhances reasoning performance with rewards aligned to tasks.
Use the RL checkpoint to generate data for supervised fine-tuning: Curate reasoning data through rejection sampling, retaining only high-quality responses. Include diverse domain-specific tasks (e.g., writing, QA, and role-playing) from DeepSeek-V3 datasets. Assemble a dataset (~800k samples) and fine-tune the model for two epochs. This ensures high-quality reasoning and non-reasoning capabilities by filtering outputs.
Implement a secondary RL stage using a mix of rule-based rewards (for reasoning) and generative rewards (for broader tasks). Optimize for reasoning, helpfulness, and harmlessness across diverse prompts and scenarios. It refines the model to perform well across diverse tasks while aligning with human preferences.
Evaluate performance on benchmarks such as MMLU, AIME, and Codeforces. Ensure the model aligns with human preferences while excelling in reasoning tasks.
Distilling Models Like Qwen-7B Using DeepSeek-R1
Use the trained DeepSeek-R1 model as the teacher to generate training data.
Generate a diverse dataset (~800k samples) with reasoning and non-reasoning tasks: Reasoning Tasks: Curate prompts and use rejection sampling from the teacher model's outputs to retain only high-quality reasoning examples. Ensure examples cover domains like math, coding, and logical reasoning.
Non-Reasoning Tasks: Use outputs from the DeepSeek-V3 pipeline for tasks such as writing, factual QA, and translation.Include structured Chain-of-Thought (CoT) reasoning only when beneficial (e.g., for complex tasks).
Choose a compact open-source model like Qwen-7B or Llama-8B as the base model. Prefer models with reasonable performance in reasoning benchmarks to ensure compatibility with the distilled knowledge.
Perform Supervised Fine-Tuning (SFT) on the chosen base model using the curated dataset. Train the model for multiple epochs to incorporate both reasoning and general-purpose capabilities.
Focus on transferring reasoning patterns and capabilities from the teacher to the smaller model. Ensure distilled outputs align closely with the teacher model's performance, particularly in reasoning-intensive tasks.
Test the distilled model on reasoning benchmarks such as AIME, MATH-500, and Codeforces. Compare its performance to both the teacher model (DeepSeek-R1) and other comparable models.
Refine the distilled model with additional SFT using newly generated data if performance gaps are identified.
Technical Consultant at Adobe
1 个月I’m excited to read your analysis on this Jayant, thanks for sharing!