DeepSeek-R1: A Pure RL-based Reasoning Model

DeepSeek-R1: A Pure RL-based Reasoning Model

I summarize the key steps involved in creating the DeepSeek models, from the foundational development of DeepSeek-R1 to the distillation process that led to the creation of the DeepSeek-R1-Distill-Qwen models.

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Available here: https://ollama.com/library/deepseek-r1

Training DeepSeek-R1-Zero

  • Base Model Initialization:

Begin with the DeepSeek-V3-Base model as the foundation, which can begin self-evolution via RL without needing curated datasets.

  • Reinforcement Learning (RL) Framework:

Employ Group Relative Policy Optimization (GRPO), which uses group rewards instead of a critic model to optimize efficiency. Generate groups of outputs for each input question to calculate group rewards and train the model. Ex: For a math problem, sampling multiple answers per question and optimizing rewards ensures a diverse understanding of solutions.

  • Reward Modeling:

Use accuracy rewards to evaluate the correctness of deterministic tasks (e.g., math problems) based on structured verification methods. Incorporate format rewards to ensure generated content adheres to a clear reasoning process format. This encourages correctness and format consistency. For example, a reasoning problem like "Prove the Pythagorean theorem" would reward both correct proof steps and clear structure.

  • Training Template:

Structure outputs using a specific template: <think> reasoning process </think> <answer> final answer </answer>. This provides structured outputs, improving readability and debugging.

  • Self-Evolution through RL:

Allow the model to evolve its reasoning capabilities naturally by iteratively refining its predictions through RL, focusing on complex reasoning tasks like coding and mathematics. The model learns to chain reasoning steps for complex tasks like solving quadratic equations.

  • Benchmark and Evaluate:

Continuously evaluate model performance on reasoning benchmarks (e.g., AIME 2024) during RL training to monitor improvements.


Training DeepSeek-R1

  • Cold Start with Supervised Fine-Tuning (SFT)

Collect thousands of high-quality reasoning examples (long Chain-of-Thoughts, or CoTs) curated through few-shot prompting, human annotation, and refinement of DeepSeek-R1-Zero outputs. Fine-tune the DeepSeek-V3-Base model with this dataset to initialize the RL actor, addressing readability and language consistency issues. This overcomes early instability in RL by starting with readable, curated data.

  • Reasoning-Oriented Reinforcement Learning

Conduct reasoning-specific RL using a reward mechanism that combines: Accuracy rewards (for correctness in coding/math), Language consistency rewards (to maintain coherence and prevent language mixing). Train until the model converges on reasoning tasks. This enhances reasoning performance with rewards aligned to tasks.

  • Rejection Sampling and SFT (Round 2)

Use the RL checkpoint to generate data for supervised fine-tuning: Curate reasoning data through rejection sampling, retaining only high-quality responses. Include diverse domain-specific tasks (e.g., writing, QA, and role-playing) from DeepSeek-V3 datasets. Assemble a dataset (~800k samples) and fine-tune the model for two epochs. This ensures high-quality reasoning and non-reasoning capabilities by filtering outputs.

  • RL for All Scenarios

Implement a secondary RL stage using a mix of rule-based rewards (for reasoning) and generative rewards (for broader tasks). Optimize for reasoning, helpfulness, and harmlessness across diverse prompts and scenarios. It refines the model to perform well across diverse tasks while aligning with human preferences.

  • Final Evaluation:

Evaluate performance on benchmarks such as MMLU, AIME, and Codeforces. Ensure the model aligns with human preferences while excelling in reasoning tasks.


Distilling Models Like Qwen-7B Using DeepSeek-R1

  • Start with the Teacher Model:

Use the trained DeepSeek-R1 model as the teacher to generate training data.

  • Data Generation:

Generate a diverse dataset (~800k samples) with reasoning and non-reasoning tasks: Reasoning Tasks: Curate prompts and use rejection sampling from the teacher model's outputs to retain only high-quality reasoning examples. Ensure examples cover domains like math, coding, and logical reasoning.

Non-Reasoning Tasks: Use outputs from the DeepSeek-V3 pipeline for tasks such as writing, factual QA, and translation.Include structured Chain-of-Thought (CoT) reasoning only when beneficial (e.g., for complex tasks).

  • Base Model Selection:

Choose a compact open-source model like Qwen-7B or Llama-8B as the base model. Prefer models with reasonable performance in reasoning benchmarks to ensure compatibility with the distilled knowledge.

  • Fine-Tune the Base Model:

Perform Supervised Fine-Tuning (SFT) on the chosen base model using the curated dataset. Train the model for multiple epochs to incorporate both reasoning and general-purpose capabilities.

  • Distillation Optimization:

Focus on transferring reasoning patterns and capabilities from the teacher to the smaller model. Ensure distilled outputs align closely with the teacher model's performance, particularly in reasoning-intensive tasks.

  • Evaluation and Benchmarking:

Test the distilled model on reasoning benchmarks such as AIME, MATH-500, and Codeforces. Compare its performance to both the teacher model (DeepSeek-R1) and other comparable models.

  • Iterative Improvement (Optional):

Refine the distilled model with additional SFT using newly generated data if performance gaps are identified.




Franco Sanchez

Technical Consultant at Adobe

1 个月

I’m excited to read your analysis on this Jayant, thanks for sharing!

要查看或添加评论,请登录

Jayant Kumar的更多文章

  • LLaVA-OneVision

    LLaVA-OneVision

    The LLaVA-NeXT series represents a groundbreaking evolution in large multimodal models with each iteration bringing…

    2 条评论
  • GraphRAG: Powerful but Expensive and Slow Solution

    GraphRAG: Powerful but Expensive and Slow Solution

    Microsoft's GraphRAG architecture represents a significant advancement in Retrieval-Augmented Generation (RAG) systems,…

    2 条评论
  • SIGIR Day 1 - Keynotes and Industry Papers

    SIGIR Day 1 - Keynotes and Industry Papers

    Day 1 started with the opening remarks from general/program chairs. Some key insights are as follows: RecSys has the…

  • LLM Alignment: Direct Preference Optimization

    LLM Alignment: Direct Preference Optimization

    In the realm of language models (LMs), alignment is essential to ensure that the outputs generated by these models meet…

    1 条评论
  • Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

    Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

    Over the past few days, there's been a flurry of posts discussing the newly unveiled Llama 3 model and its impressive…

  • Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

    Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

    The Unfolding Drama in Early 2023: Unrealistic Projections, Layoffs, and the Pressure to Innovate As the curtains rose…

    1 条评论
  • AI Horizons: A Closer Look at the Five Big AI Bets in 2023

    AI Horizons: A Closer Look at the Five Big AI Bets in 2023

    As we navigate the ever-evolving landscape of artificial intelligence, it's natural to wonder – which bets are paying…

    1 条评论
  • BERT as a service

    BERT as a service

    There are multiple ways of leveraging the open source BERT model for your NLP work, for example, via huggingface…

  • Custom Object Detector

    Custom Object Detector

    Recently I had a chance to try Tensorflow object detection API to develop a custom object detector - an object…

    2 条评论
  • Learning by Teaching

    Learning by Teaching

    I had heard before that the best way to learn anything is to try to teach it to others. If you can explain a topic of…

    3 条评论

社区洞察

其他会员也浏览了