登录查看更多内容

DeepSeek-R1: A Pure RL-based Reasoning Model

Jayant Kumar

Principal ML Scientist at Adobe | Technical Advisor at Preffect | Multimodal AI | Large language models and Knowledge Graph applications

发布日期: 2025年1月26日

I summarize the key steps involved in creating the DeepSeek models, from the foundational development of DeepSeek-R1 to the distillation process that led to the creation of the DeepSeek-R1-Distill-Qwen models.

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Available here: https://ollama.com/library/deepseek-r1

Training DeepSeek-R1-Zero

Base Model Initialization:

Begin with the DeepSeek-V3-Base model as the foundation, which can begin self-evolution via RL without needing curated datasets.

Reinforcement Learning (RL) Framework:

Employ Group Relative Policy Optimization (GRPO), which uses group rewards instead of a critic model to optimize efficiency. Generate groups of outputs for each input question to calculate group rewards and train the model. Ex: For a math problem, sampling multiple answers per question and optimizing rewards ensures a diverse understanding of solutions.

Reward Modeling:

Use accuracy rewards to evaluate the correctness of deterministic tasks (e.g., math problems) based on structured verification methods. Incorporate format rewards to ensure generated content adheres to a clear reasoning process format. This encourages correctness and format consistency. For example, a reasoning problem like "Prove the Pythagorean theorem" would reward both correct proof steps and clear structure.

Training Template:

Structure outputs using a specific template: <think> reasoning process </think> <answer> final answer </answer>. This provides structured outputs, improving readability and debugging.

Self-Evolution through RL:

Allow the model to evolve its reasoning capabilities naturally by iteratively refining its predictions through RL, focusing on complex reasoning tasks like coding and mathematics. The model learns to chain reasoning steps for complex tasks like solving quadratic equations.

Benchmark and Evaluate:

Continuously evaluate model performance on reasoning benchmarks (e.g., AIME 2024) during RL training to monitor improvements.

Training DeepSeek-R1

Cold Start with Supervised Fine-Tuning (SFT)

Collect thousands of high-quality reasoning examples (long Chain-of-Thoughts, or CoTs) curated through few-shot prompting, human annotation, and refinement of DeepSeek-R1-Zero outputs. Fine-tune the DeepSeek-V3-Base model with this dataset to initialize the RL actor, addressing readability and language consistency issues. This overcomes early instability in RL by starting with readable, curated data.

Reasoning-Oriented Reinforcement Learning

Conduct reasoning-specific RL using a reward mechanism that combines: Accuracy rewards (for correctness in coding/math), Language consistency rewards (to maintain coherence and prevent language mixing). Train until the model converges on reasoning tasks. This enhances reasoning performance with rewards aligned to tasks.

Rejection Sampling and SFT (Round 2)

Use the RL checkpoint to generate data for supervised fine-tuning: Curate reasoning data through rejection sampling, retaining only high-quality responses. Include diverse domain-specific tasks (e.g., writing, QA, and role-playing) from DeepSeek-V3 datasets. Assemble a dataset (~800k samples) and fine-tune the model for two epochs. This ensures high-quality reasoning and non-reasoning capabilities by filtering outputs.

领英推荐

??Top ML Papers of the Week

DAIR.AI 4 个月前

??Top ML Papers of the Week

DAIR.AI 11 个月前

??Top ML Papers of the Week

DAIR.AI 1 年前

RL for All Scenarios

Implement a secondary RL stage using a mix of rule-based rewards (for reasoning) and generative rewards (for broader tasks). Optimize for reasoning, helpfulness, and harmlessness across diverse prompts and scenarios. It refines the model to perform well across diverse tasks while aligning with human preferences.

Final Evaluation:

Evaluate performance on benchmarks such as MMLU, AIME, and Codeforces. Ensure the model aligns with human preferences while excelling in reasoning tasks.

Distilling Models Like Qwen-7B Using DeepSeek-R1

Start with the Teacher Model:

Use the trained DeepSeek-R1 model as the teacher to generate training data.

Data Generation:

Generate a diverse dataset (~800k samples) with reasoning and non-reasoning tasks: Reasoning Tasks: Curate prompts and use rejection sampling from the teacher model's outputs to retain only high-quality reasoning examples. Ensure examples cover domains like math, coding, and logical reasoning.

Non-Reasoning Tasks: Use outputs from the DeepSeek-V3 pipeline for tasks such as writing, factual QA, and translation.Include structured Chain-of-Thought (CoT) reasoning only when beneficial (e.g., for complex tasks).

Base Model Selection:

Choose a compact open-source model like Qwen-7B or Llama-8B as the base model. Prefer models with reasonable performance in reasoning benchmarks to ensure compatibility with the distilled knowledge.

Fine-Tune the Base Model:

Perform Supervised Fine-Tuning (SFT) on the chosen base model using the curated dataset. Train the model for multiple epochs to incorporate both reasoning and general-purpose capabilities.

Distillation Optimization:

Focus on transferring reasoning patterns and capabilities from the teacher to the smaller model. Ensure distilled outputs align closely with the teacher model's performance, particularly in reasoning-intensive tasks.

Evaluation and Benchmarking:

Test the distilled model on reasoning benchmarks such as AIME, MATH-500, and Codeforces. Compare its performance to both the teacher model (DeepSeek-R1) and other comparable models.

Iterative Improvement (Optional):

Refine the distilled model with additional SFT using newly generated data if performance gaps are identified.

Franco Sanchez

Technical Consultant at Adobe

1 个月

I’m excited to read your analysis on this Jayant, thanks for sharing!

2 次回应

要查看或添加评论，请登录

Jayant Kumar的更多文章

LLaVA-OneVision

2024年9月21日

LLaVA-OneVision

The LLaVA-NeXT series represents a groundbreaking evolution in large multimodal models with each iteration bringing…

2 条评论
GraphRAG: Powerful but Expensive and Slow Solution

2024年7月29日

GraphRAG: Powerful but Expensive and Slow Solution

Microsoft's GraphRAG architecture represents a significant advancement in Retrieval-Augmented Generation (RAG) systems,…

2 条评论
SIGIR Day 1 - Keynotes and Industry Papers

2024年7月16日

SIGIR Day 1 - Keynotes and Industry Papers

Day 1 started with the opening remarks from general/program chairs. Some key insights are as follows: RecSys has the…
LLM Alignment: Direct Preference Optimization

2024年7月13日

LLM Alignment: Direct Preference Optimization

In the realm of language models (LMs), alignment is essential to ensure that the outputs generated by these models meet…

1 条评论
Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

2024年4月20日

Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

Over the past few days, there's been a flurry of posts discussing the newly unveiled Llama 3 model and its impressive…
Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

2023年12月31日

Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

The Unfolding Drama in Early 2023: Unrealistic Projections, Layoffs, and the Pressure to Innovate As the curtains rose…

1 条评论
AI Horizons: A Closer Look at the Five Big AI Bets in 2023

2023年12月22日

AI Horizons: A Closer Look at the Five Big AI Bets in 2023

As we navigate the ever-evolving landscape of artificial intelligence, it's natural to wonder – which bets are paying…

1 条评论
BERT as a service

2020年5月17日

BERT as a service

There are multiple ways of leveraging the open source BERT model for your NLP work, for example, via huggingface…
Custom Object Detector

2018年12月2日

Custom Object Detector

Recently I had a chance to try Tensorflow object detection API to develop a custom object detector - an object…

2 条评论
Learning by Teaching

2015年8月22日

Learning by Teaching

I had heard before that the best way to learn anything is to try to teach it to others. If you can explain a topic of…

3 条评论

See all articles

DeepSeek-R1: A Pure RL-based Reasoning Model

Jayant Kumar

Principal ML Scientist at Adobe | Technical Advisor at Preffect | Multimodal AI | Large language models and Knowledge Graph applications

Training DeepSeek-R1-Zero

Training DeepSeek-R1

领英推荐

Distilling Models Like Qwen-7B Using DeepSeek-R1

Jayant Kumar的更多文章

社区洞察

其他会员也浏览了

?? How to Expand LLMs Memory

NLP-A Complete Guide for Topic Modeling- Latent Dirichlet Allocation (LDA) using Gensim!

What Is Machine Learning?

Augmenting Mathematical Optimization with Reinforcement Learning

OpenAI's O3: A Leap Forward in AI Reasoning Models

Know Your Algorithms: A Comprehensive Guide to Common Machine Learning Algorithms

Transformers for Multi-Class Classification: A fine-tuning approach

Why Machines Learn? purpose and process

DeepSeek R1 Model - Part 1

Generative AI Series: A Comprehensive Journey from Basics to Cutting-Edge Innovation

Training DeepSeek-R1-Zero

Training DeepSeek-R1

领英推荐

Distilling Models Like Qwen-7B Using DeepSeek-R1

Jayant Kumar的更多文章

LLaVA-OneVision

GraphRAG: Powerful but Expensive and Slow Solution

SIGIR Day 1 - Keynotes and Industry Papers

LLM Alignment: Direct Preference Optimization

Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

AI Horizons: A Closer Look at the Five Big AI Bets in 2023

BERT as a service

Custom Object Detector

Learning by Teaching

社区洞察

其他会员也浏览了

?? How to Expand LLMs Memory

NLP-A Complete Guide for Topic Modeling- Latent Dirichlet Allocation (LDA) using Gensim!

What Is Machine Learning?

Augmenting Mathematical Optimization with Reinforcement Learning

OpenAI's O3: A Leap Forward in AI Reasoning Models

Know Your Algorithms: A Comprehensive Guide to Common Machine Learning Algorithms

Transformers for Multi-Class Classification: A fine-tuning approach

Why Machines Learn? purpose and process

DeepSeek R1 Model - Part 1

Generative AI Series: A Comprehensive Journey from Basics to Cutting-Edge Innovation