登录查看更多内容

DeepSeek-R1: Reasoning Capability with Reinforcement Learning

Rohit Sharma

AI/ML Computational Science Manager

发布日期: 2025年1月25日

I’ve been reading DeepSeek’s paper and what strikes me the most isn’t just the technical bits but the way these guys have laid bare the thought process, challenges and solutions behind their work.

In an era where AI advancements are often hidden in secrecy, this bunch of researchers have chosen a transparent approach that offers us a glimpse into the intricate process of creating such models - how it emerged, evolved and already challenging the status quo of guys like OpenAI.

Loved the humility and transparency they have exhibited. ??

Here are my takeaways.

What sets it apart?

DeepSeek-R1 represents a shift in reasoning models by tweaking the traditional training process.

Traditional approach relies heavily on supervised fine tuning (SFT). But, both DeepSeek-R1-Zero and DeepSeek-R1 leverage reinforcement learning as the primary mechanism to gain reasoning capabilities that they have.

Here is a brief of how they evolved their reasoning models

DeepSeek-R1-Zero

The concept behind RL only approach:

DeepSeek-R1-Zero skipped the SFT step entirely reducing the reliance on costly SFT datasets. The model explored reasoning capabilities through self-evolution.

If you think of it - this approach aligns with what happens in the real world and kind-of explores the potential for more autonomous/resource-efficient model training.

The training breakdown:

1. Base model (DeepSeek-V3-Base) was picked as the foundation for training.

2. SFT was entirely skipped and instead focus was entirely on RL to develop reasoning capabilities from scratch.

3. (GRPO) Group Relative Policy Optimization was used to reduce the training cost by avoiding the need for a separate critic model. Instead, it uses group scores to estimate the baseline for optimization. (For details of what GPRO is refer: https://arxiv.org/pdf/2402.03300)

4. Two sets of reward system were used

Accuracy Rewards: Evaluated if the model’s answers are correct using rule-based criteria (e.g. solving math tasks/ coding tasks).

Format Rewards: Encouraged structured responses to ensure the model’s thinking process is clearly outlined (e.g. enclosing thoughts within specific tags like <think> and </think>).

5. Structured Training Template were used to keep responses consistent. This was a 2 step approach. First step being generation of a reasoning process followed by synthesis of final response based on the reasoning.

6. The magic moment : Self-Evolution through RL happened as the model improves its reasoning capabilities as RL training progresses. This led to longer and better CoT. This was signified by the fact that it started taking more of "thinking time" to solve complex problems followed by reflections where it learned to revisit its solutions autonomously!

7. Emergent behaviors were observed. Over the training course - the model developed reasoning skills like self-verification and exploring the alternatives.

8. Finally it achieved the ?evaluation milestones i.e. a jump in AIME 2024 benchmark performance from 15.6% to 71.0% which was comparable to OpenAI-o1-0912.

DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time (source:

However DeepSeek-R-Zero suffered from major drawbacks. It had challenges like poor readability and language mixing.

To address these - the enhanced DeepSeek-R1 incorporated multi-stage training with cold-start data and in the end claims of achieving performance comparable to OpenAI-o1-1217 across reasoning benchmarks.

DeepSeek-R1 - The next step - How it evolved?

领英推荐

Challenges and Innovations in Reinforcement Learning…

Analytics Insight? 4 个月前

DeepSeek and Reinforcement Learning

Ajit Jaokar 1 个月前

DeepSeek-R1: Enhancing LLM Reasoning with…

Chander D. 1 个月前

DeepSeek-R1 incorporating a combination of cold-start data, RL and a layered training approach to address the above issues.

The training breakdown:

1. Cold-Start Preparation: checkpoint was created with fine-tuning by fine-tuning the DeepSeek-V3-Base model. This cold start data contained a set of carefully curated CoT examples which included

Few-shot prompts with examples of detailed reasoning,
Detailed answers with reflection & verification.
Refining o/p from DeepSeek-R1-Zero through human post-processing.

2. Reasoning-Oriented Reinforcement Learning (RL). This step was largely same as Deepseek-R-zero). RL similar to DeepSeek-R1-Zero was performed.

3. Rejection Sampling and Supervised Fine-Tuning (SFT): As it neared convergence in the RL process, new SFT data was synthesized using the rejection sampling on RL checkpoints. Only accurate and readable outputs were retained using:

Rejection sampling.
Rule-based & generative reward models.

4. Dataset generation for SFT : Above data set was enriched with additional supervised data generated from DeepSeek-V3. This dataset massive samples for tasks like writing, factual QA, self-cognition and translation.

5. SFT with combined data: In next step - the model was further trained with the combined data created in the last step. This led to improved reasoning and general capabilities.

4. Final RL layer: A final additional RL using prompts from diverse scenarios to align with loads of human preferences.

Note: Before you get confused - remember this was the secondary RL stage to further improve the model’s HHH (helpfulness, Harmlessness and Honesty).

Wide variety of diverse training prompts and reward signals were used in this step.

What they ended up with each step:

Improved readability: The use of cold-start data ensured responses get clearer, structured and more user-friendly.
Enhanced performance: RL built on this foundation enabled exceptional performance in reasoning tasks.
Broader capabilities: SFT added versatility allowing the model to excel in general purpose tasks as well in addition to reasoning tasks.
User centric design: Final layer of RL focus on HHH ensured the model’s outputs were not only accurate but also safe and easy to understand.

Benchmark performance of DeepSeek-R1 (source:

They didn't stop. read on...

The power of Distillation from R1 to smaller models:

Large, computationally intensive model DeepSeek-R1 was then distilled into smaller, dense models effectively transferring their advanced reasoning skills into smaller baby models. (Clever! ??)

What happened then was amazing:

Keeping Qwen2.5-32B as the base model and distillation from DeepSeek-R1 outperformed the model created by applying RL on the base model itself.

This was a major proof that the reasoning abilities discovered by larger models are critical for improving smaller model’s performance.

DeepSeek has open-sourced the distilled models like Qwen for the AI community. The distilled 14B model surpasses QwQ-32B-Preview by a huge merging demonstrating how valuable model distillation could get.

Conclusion

DeepSeek-R1 is a pioneering unique achievements in LLM development proving the untapped potential of RL for reasoning tasks.

Its unique combination of RL, cold-start trick and efficient distillation methodology provides a scalable and accessible (cheaper) alternative to the big players like OpenAI/Gemini.

By challenging the status quo and emphasizing openness, DeepSeek-R1 is not only setting new benchmarks in reasoning but also paving a way for more collaborative and equitable advancements in AI.

This marks a paradigm shift in how we approach and democratize advanced AI capabilities.??

Hayk C.

Founder @Agentgrow | 3x Head of Sales

1 个月

It's really inspiring to see DeepSeek AI's commitment to transparency and sharing their work. What was your biggest takeaway from their paper? I’d love to hear more about your thoughts on their approach!

1 次回应

查看更多评论

要查看或添加评论，请登录

Rohit Sharma的更多文章

Most Chatbot Suck (And It’s Not Because of the LLM)

2025年2月27日

Most Chatbot Suck (And It’s Not Because of the LLM)

It’s easy to plug in an API and get a chatbot to respond. In fact with the level of abstraction available today –…
Chat with SQL: AI-Powered Natural Language to Database Queries

2025年2月22日

Chat with SQL: AI-Powered Natural Language to Database Queries

In AI-driven applications, natural language to SQL (NLP-to-SQL) is becoming an essential capability. While not a…

1 条评论
My Experiment with Neo4j and the Power of Graph-Based RAG

2025年1月19日

My Experiment with Neo4j and the Power of Graph-Based RAG

Semantic search and knowledge graphs are two pre-dominant paradigms for working with knowledge sources. Semantic search…

2 条评论
Revolutionizing AI with Jamba: The Cost-Effective Game-Changer for Long Contexts

2025年1月14日

Revolutionizing AI with Jamba: The Cost-Effective Game-Changer for Long Contexts

If you think all LLMs are the same - think again. Every time I find something new when I deep dive into a new…

1 条评论
Agentic AI Is Not a Magic Pill - Here’s Why Hard Work and Skills Still Matter

2024年12月8日

Agentic AI Is Not a Magic Pill - Here’s Why Hard Work and Skills Still Matter

As I wrap up another weekend tinkering with agentic AI - one thing is clear: frameworks and hype aside building…
LangChain Templates - Turbocharging AI Development

2024年4月27日

LangChain Templates - Turbocharging AI Development

Now quite a while ago – Langchain released “LangChain templates” which I think are going to revolutionize the AI app…

See all articles

DeepSeek-R1: Reasoning Capability with Reinforcement Learning

Rohit Sharma

AI/ML Computational Science Manager

What sets it apart?

Here is a brief of how they evolved their reasoning models

DeepSeek-R1-Zero

The concept behind RL only approach:

The training breakdown:

DeepSeek-R1 - The next step - How it evolved?

领英推荐

The training breakdown:

What they ended up with each step:

The power of Distillation from R1 to smaller models:

What happened then was amazing:

Conclusion

Rohit Sharma的更多文章

社区洞察

其他会员也浏览了

MENACE: The Seed of Machine Learning

Reinforcement learning for Large Language Models

Curiosity is All You Need: Learn How GPT Was Created in Just a Few Minutes

Machine Learning Explained: Understanding Supervised, Unsupervised, and Reinforcement Learning

Deep Dive into DeepSeek R1: Revolutionizing LLM Reinforcement Learning through Group Relative Policy Optimization (GRPO)

Reinforcement Learning: Algorithms, Types, and Applications

Engineering AI Brilliance: The Fortune 500 Approach to Revolutionary LLM Results

How Reinforcement Learning helps Decision Making

Advancing Deep Fake Detection Through Multi-Task Learning: An In-Depth Analysis

FrameFlow: Folding Proteins with the Flow, Not the Fold!

What sets it apart?

Here is a brief of how they evolved their reasoning models

DeepSeek-R1-Zero

The concept behind RL only approach:

The training breakdown:

DeepSeek-R1 - The next step - How it evolved?

领英推荐

The training breakdown:

What they ended up with each step:

The power of Distillation from R1 to smaller models:

What happened then was amazing:

Conclusion

Rohit Sharma的更多文章

Most Chatbot Suck (And It’s Not Because of the LLM)

Chat with SQL: AI-Powered Natural Language to Database Queries

My Experiment with Neo4j and the Power of Graph-Based RAG

Revolutionizing AI with Jamba: The Cost-Effective Game-Changer for Long Contexts

Agentic AI Is Not a Magic Pill - Here’s Why Hard Work and Skills Still Matter

LangChain Templates - Turbocharging AI Development

社区洞察

其他会员也浏览了

MENACE: The Seed of Machine Learning

Reinforcement learning for Large Language Models

Curiosity is All You Need: Learn How GPT Was Created in Just a Few Minutes

Machine Learning Explained: Understanding Supervised, Unsupervised, and Reinforcement Learning

Deep Dive into DeepSeek R1: Revolutionizing LLM Reinforcement Learning through Group Relative Policy Optimization (GRPO)

Reinforcement Learning: Algorithms, Types, and Applications

Engineering AI Brilliance: The Fortune 500 Approach to Revolutionary LLM Results

How Reinforcement Learning helps Decision Making

Advancing Deep Fake Detection Through Multi-Task Learning: An In-Depth Analysis

FrameFlow: Folding Proteins with the Flow, Not the Fold!