DeepSeek-R1: Enhancing LLM Reasoning with Reinforcement Learning
Chander D.
CEO of Cazton, Author, Microsoft AI MVP, Microsoft RD & Google Developer Expert Award
Highlights
Introduction
Advancing the reasoning capabilities of Large Language Models (LLMs) remains a key challenge in artificial intelligence research. Traditional methods often rely heavily on supervised fine-tuning (SFT) using large amounts of annotated data, which can be resource-intensive and time-consuming. The paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" explores an alternative approach: improving reasoning abilities in LLMs through pure reinforcement learning (RL), without initial supervised fine-tuning. This blog post delves into the methods, experiments, and findings of this research, offering insights into how RL can be used to enhance reasoning in LLMs and the implications for future developments.
DeepSeek-R1-Zero: Reinforcement Learning on the Base Model
Overview
The researchers introduced DeepSeek-R1-Zero, a model trained exclusively using RL starting from the base model, without any supervised fine-tuning. The primary goal was to explore whether an LLM could develop reasoning capabilities through self-evolution driven by RL alone. This approach allows the model to naturally develop reasoning behaviors without biases introduced by supervised data, providing a clear view of its learning trajectory.
Reinforcement Learning Algorithm
To efficiently train the model, the team employed the Group Relative Policy Optimization (GRPO) algorithm. GRPO optimizes the policy model by sampling a group of outputs for each question and calculating the advantage of each output within the group. This method negates the need for a separate value network (critic), reducing computational overhead.
The objective function for GRPO is as follows:
JGRPO(θ) = Eq~P(Q), {oi}i=1G~πθ_old(O|q)> [ (1/G) ∑i=1G min(ratioi * Ai, clipped_ratioi * Ai) - β * DKL[πθ || πref])]
Here, the ratio reflects the probability under the new policy compared to the old policy, and the advantage Ai is computed based on the rewards within the group.
Reward Modeling
In the absence of supervised data, reward modeling becomes crucial for guiding the RL process. The team designed a rule-based reward system comprising:
By using rule-based rewards, the model avoids issues like reward hacking that can arise with neural reward models, which might require additional retraining and complicate the training pipeline.
Training Template
The training involved prompting the model with a specific template to guide its responses. The template is as follows:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
User: [prompt]
Assistant:
This template encourages the model to structure its responses clearly, separating the reasoning process from the final answer, which is essential for evaluating and improving its reasoning capabilities.
Performance, Self-evolution Process, and Aha Moment
Performance of DeepSeek-R1-Zero
Throughout the RL training process, DeepSeek-R1-Zero showed significant improvement in reasoning tasks. For instance, on the American Invitational Mathematics Examination 2024 (AIME 2024) benchmark, the model's pass@1 score increased from an initial 15.6% to an impressive 71.0%, achieving performance levels comparable to OpenAI's o1-0912 model. When using majority voting, the score further improved to 86.7%, surpassing the OpenAI model.
Self-evolution Process
An intriguing aspect of the training was observing how DeepSeek-R1-Zero evolved its reasoning capabilities autonomously. The model began to allocate more "thinking time" to solve complex problems by generating longer chains of thought. This self-directed increase in reasoning depth allowed the model to handle more challenging tasks effectively.
Moreover, the model spontaneously developed sophisticated behaviors such as:
These emergent behaviors highlight the potential of RL to induce advanced reasoning skills in LLMs without explicit programming.
Aha Moment of DeepSeek-R1-Zero
During training, the researchers witnessed an "aha moment" where the model displayed human-like insight. For example, when solving a complex math problem, the model paused and acknowledged a flaw in its reasoning:
...
Wait, wait. Wait. That's an aha moment I can flag here. Let's reevaluate this step-by-step to identify if the correct sum can be ...
...
This spontaneous expression of realizing a mistake and deciding to rework the problem demonstrates the model's advanced reasoning development through RL.
Drawbacks of DeepSeek-R1-Zero
Despite its impressive reasoning capabilities, DeepSeek-R1-Zero faced challenges:
These issues indicated the need for further refinement to make the model's outputs more user-friendly.
DeepSeek-R1: Reinforcement Learning with Cold Start
Addressing DeepSeek-R1-Zero's Limitations
To overcome the challenges observed in DeepSeek-R1-Zero, the researchers developed DeepSeek-R1. This model incorporates a small amount of high-quality, supervised data as a cold start before applying RL. The goal was to enhance the model's readability, prevent language mixing, and further improve reasoning performance.
Cold Start
The cold start phase involved fine-tuning the base model using a curated set of thousands of examples containing long chains of thought. The supervised data was designed to be reader-friendly and to establish a strong foundation for the model's reasoning patterns. Key considerations during this phase included:
Reasoning-oriented Reinforcement Learning
Following the cold start, the model underwent a reasoning-focused RL training phase similar to that of DeepSeek-R1-Zero. During this phase, additional measures were implemented to address language mixing:
By combining accuracy rewards with language consistency rewards, the model was guided to produce correct and coherent reasoning in the desired language.
Rejection Sampling and Supervised Fine-Tuning
Upon convergence of the reasoning-oriented RL phase, the researchers collected new supervised data to further refine the model:
领英推荐
The combined dataset included approximately 800,000 samples and was used to fine-tune the model over two epochs. This stage aimed to improve both reasoning and non-reasoning abilities while addressing issues like readability and format consistency.
Reinforcement Learning for All Scenarios
In the final training stage, the model underwent another RL phase to align it with human preferences across various scenarios. This involved:
By integrating these elements, DeepSeek-R1 achieved a balanced performance, excelling in reasoning tasks while maintaining general usefulness and safety.
Distillation: Empowering Small Models with Reasoning Capability
Recognizing the importance of making advanced reasoning capabilities accessible in smaller, more efficient models, the team explored distillation techniques. They fine-tuned smaller models based on Qwen and Llama architectures using the supervised data generated by DeepSeek-R1. This approach allowed them to transfer the reasoning patterns from the larger model to smaller ones, resulting in efficient models with strong reasoning abilities.
Key findings include:
Experiment
DeepSeek-R1 Evaluation
The team conducted comprehensive evaluations of DeepSeek-R1 across various benchmarks. Results showed that:
Distilled Model Evaluation
Evaluation of the distilled models revealed that:
Discussion
Distillation vs. Reinforcement Learning
The researchers explored whether smaller models could achieve similar reasoning performance through RL without distillation. They trained a 32B parameter base model using RL, similar to DeepSeek-R1-Zero. However, the distilled model outperformed the RL-trained smaller model across all benchmarks.
This indicates that while RL is effective for larger models, distillation from a powerful reasoning model like DeepSeek-R1 is a more efficient method for enhancing reasoning in smaller models.
Unsuccessful Attempts
The paper also discusses strategies that were less successful, providing valuable insights:
Process Reward Model (PRM)
PRM involves providing rewards based on intermediate reasoning steps. Challenges with PRM included:
While PRM can be useful for reranking responses or guiding search algorithms, it was not effective in large-scale RL training for this research.
Monte Carlo Tree Search (MCTS)
Inspired by successes in games like Go, the team experimented with MCTS to enhance test-time compute scalability. However, they faced challenges:
While MCTS can improve inference when paired with a pre-trained value model, it was not suitable for self-improvement through search in this context.
Conclusion
The research demonstrates that reinforcement learning can significantly enhance reasoning capabilities in LLMs, even without supervised fine-tuning. DeepSeek-R1-Zero exhibited advanced reasoning behaviors purely through RL, although it faced challenges in readability and language consistency. By incorporating a small amount of cold-start data and multi-stage training, DeepSeek-R1 achieved both high performance and user-friendly outputs, performing on par with leading models like OpenAI's o1-1217.
The successful distillation of reasoning capabilities into smaller models underscores the potential for making advanced reasoning accessible in more efficient architectures. Open-sourcing these models provides valuable resources for the research community to further explore and develop reasoning in LLMs.
The discussion highlights that while RL is powerful, distillation from a strong reasoning model is more effective for smaller models. Additionally, the insights from unsuccessful attempts provide guidance on the limitations of certain methods in this domain.
Limitations and Future Work
While the results are promising, the research acknowledges several limitations and areas for future exploration:
The research opens up new avenues for enhancing reasoning in LLMs and highlights the potential of reinforcement learning and distillation in advancing AI capabilities.
Acknowledgments
The researchers express gratitude to all contributors and collaborators involved in this work. The open-sourcing of the models and sharing of insights aim to benefit the wider AI research community.
DeepSeek is NOT What You Think It Is
In the AI world, misinformation runs rampant. Remember 2023? Every week, it seemed like a new "better" model than GPT-4 was being touted. But here’s the truth—most of those claims were far from reality. Yes, certain models might have shown promise in specific scenarios or research papers, but not a single one came close to matching GPT-4's overall performance.
Unfortunately, many influencers and so-called experts doubled down on these unverified claims, either believing the hype, falling for cherry-picked data, or, in some cases, being paid to promote these models as superior.
At Cazton, we take a different approach. We don’t just follow trends—we critically evaluate the technology. Our clients trust us because we go beyond the surface, understanding the strengths and weaknesses of each AI model. This deep expertise is why enterprise clients love working with us. They know we do the hard work of figuring out exactly which technology to use, when to use it, and how to implement it for the best results.
An AI model is just a model. It’s part of a comprehensive AI system. And that system is built on more than just a single model. A model is only about 5% of the solution. The real magic comes from integrating the right tools, infrastructure, and processes to create an AI solution that actually delivers.
Ready to work with a team that knows the difference? Let’s talk about how we can leverage DeepSeek and other cutting-edge technologies to power your business.
CEO of Cazton, Author, Microsoft AI MVP, Microsoft RD & Google Developer Expert Award
3 周https://arxiv.org/abs/2501.12948