DeepSeek-R1: Reasoning Capability with Reinforcement Learning

DeepSeek-R1: Reasoning Capability with Reinforcement Learning

I’ve been reading DeepSeek’s paper and what strikes me the most isn’t just the technical bits but the way these guys have laid bare the thought process, challenges and solutions behind their work.

In an era where AI advancements are often hidden in secrecy, this bunch of researchers have chosen a transparent approach that offers us a glimpse into the intricate process of creating such models - how it emerged, evolved and already challenging the status quo of guys like OpenAI.

Loved the humility and transparency they have exhibited. ??

Here are my takeaways.


What sets it apart?

DeepSeek-R1 represents a shift in reasoning models by tweaking the traditional training process.

Traditional approach relies heavily on supervised fine tuning (SFT). But, both DeepSeek-R1-Zero and DeepSeek-R1 leverage reinforcement learning as the primary mechanism to gain reasoning capabilities that they have.


Here is a brief of how they evolved their reasoning models

DeepSeek-R1-Zero

The concept behind RL only approach:

DeepSeek-R1-Zero skipped the SFT step entirely reducing the reliance on costly SFT datasets. The model explored reasoning capabilities through self-evolution.

If you think of it - this approach aligns with what happens in the real world and kind-of explores the potential for more autonomous/resource-efficient model training.

The training breakdown:

1. Base model (DeepSeek-V3-Base) was picked as the foundation for training.

2. SFT was entirely skipped and instead focus was entirely on RL to develop reasoning capabilities from scratch.

3. (GRPO) Group Relative Policy Optimization was used to reduce the training cost by avoiding the need for a separate critic model. Instead, it uses group scores to estimate the baseline for optimization. (For details of what GPRO is refer: https://arxiv.org/pdf/2402.03300)

4. Two sets of reward system were used

Accuracy Rewards: Evaluated if the model’s answers are correct using rule-based criteria (e.g. solving math tasks/ coding tasks).

Format Rewards: Encouraged structured responses to ensure the model’s thinking process is clearly outlined (e.g. enclosing thoughts within specific tags like <think> and </think>).

5. Structured Training Template were used to keep responses consistent. This was a 2 step approach. First step being generation of a reasoning process followed by synthesis of final response based on the reasoning.

6. The magic moment : Self-Evolution through RL happened as the model improves its reasoning capabilities as RL training progresses. This led to longer and better CoT. This was signified by the fact that it started taking more of "thinking time" to solve complex problems followed by reflections where it learned to revisit its solutions autonomously!

7. Emergent behaviors were observed. Over the training course - the model developed reasoning skills like self-verification and exploring the alternatives.

8. Finally it achieved the ?evaluation milestones i.e. a jump in AIME 2024 benchmark performance from 15.6% to 71.0% which was comparable to OpenAI-o1-0912.

?

DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time (source:


However DeepSeek-R-Zero suffered from major drawbacks. It had challenges like poor readability and language mixing.

To address these - the enhanced DeepSeek-R1 incorporated multi-stage training with cold-start data and in the end claims of achieving performance comparable to OpenAI-o1-1217 across reasoning benchmarks.

DeepSeek-R1 - The next step - How it evolved?

?

DeepSeek-R1 incorporating a combination of cold-start data, RL and a layered training approach to address the above issues.

The training breakdown:

1. Cold-Start Preparation: checkpoint was created with fine-tuning by fine-tuning the DeepSeek-V3-Base model. This cold start data contained a set of carefully curated CoT examples which included

  • Few-shot prompts with examples of detailed reasoning,
  • Detailed answers with reflection & verification.
  • Refining o/p from DeepSeek-R1-Zero through human post-processing.

2. Reasoning-Oriented Reinforcement Learning (RL). This step was largely same as Deepseek-R-zero). RL similar to DeepSeek-R1-Zero was performed.

3. Rejection Sampling and Supervised Fine-Tuning (SFT): As it neared convergence in the RL process, new SFT data was synthesized using the rejection sampling on RL checkpoints. Only accurate and readable outputs were retained using:

  • Rejection sampling.
  • Rule-based & generative reward models.

4. Dataset generation for SFT : Above data set was enriched with additional supervised data generated from DeepSeek-V3. This dataset massive samples for tasks like writing, factual QA, self-cognition and translation.

5. SFT with combined data: In next step - the model was further trained with the combined data created in the last step. This led to improved reasoning and general capabilities.

4. Final RL layer: A final additional RL using prompts from diverse scenarios to align with loads of human preferences.

Note: Before you get confused - remember this was the secondary RL stage to further improve the model’s HHH (helpfulness, Harmlessness and Honesty).

Wide variety of diverse training prompts and reward signals were used in this step.


What they ended up with each step:

  • Improved readability: The use of cold-start data ensured responses get clearer, structured and more user-friendly.
  • Enhanced performance: RL built on this foundation enabled exceptional performance in reasoning tasks.
  • Broader capabilities: SFT added versatility allowing the model to excel in general purpose tasks as well in addition to reasoning tasks.
  • User centric design: Final layer of RL focus on HHH ensured the model’s outputs were not only accurate but also safe and easy to understand.

??

Benchmark performance of DeepSeek-R1 (source:

?

They didn't stop. read on...

The power of Distillation from R1 to smaller models:

Large, computationally intensive model DeepSeek-R1 was then distilled into smaller, dense models effectively transferring their advanced reasoning skills into smaller baby models. (Clever! ??)


What happened then was amazing:

Keeping Qwen2.5-32B as the base model and distillation from DeepSeek-R1 outperformed the model created by applying RL on the base model itself.

This was a major proof that the reasoning abilities discovered by larger models are critical for improving smaller model’s performance.

DeepSeek has open-sourced the distilled models like Qwen for the AI community. The distilled 14B model surpasses QwQ-32B-Preview by a huge merging demonstrating how valuable model distillation could get.


Conclusion

DeepSeek-R1 is a pioneering unique achievements in LLM development proving the untapped potential of RL for reasoning tasks.

Its unique combination of RL, cold-start trick and efficient distillation methodology provides a scalable and accessible (cheaper) alternative to the big players like OpenAI/Gemini.

By challenging the status quo and emphasizing openness, DeepSeek-R1 is not only setting new benchmarks in reasoning but also paving a way for more collaborative and equitable advancements in AI.

This marks a paradigm shift in how we approach and democratize advanced AI capabilities.??

Hayk C.

Founder @Agentgrow | 3x Head of Sales

1 个月

It's really inspiring to see DeepSeek AI's commitment to transparency and sharing their work. What was your biggest takeaway from their paper? I’d love to hear more about your thoughts on their approach!

要查看或添加评论,请登录

Rohit Sharma的更多文章

社区洞察

其他会员也浏览了