登录查看更多内容

Group Relative Policy Optimization (GRPO) in Reinforcement Learning from Human Feedback (RLHF): Insights from DeepSeek

Zahir Shaikh

Lead (Generative AI / Automation) @ T-Systems | Specializing in Automation, Large Language Models (LLM), LLAMA Index, Langchain | Expert in Deep Learning, Machine Learning, NLP, Vector Databases | RPA

发布日期: 2025年1月29日

1. Introduction to the Buzz About DeepSeek

DeepSeek-R1-Zero has been making waves in the AI research community with its novel approach to reinforcement learning (RL). It stands out due to its ability to self-evolve without explicit supervision, achieving remarkable results on benchmarks such as AIME 2024. The introduction of Group Relative Policy Optimization (GRPO) has played a crucial role in this success, offering an alternative to existing reinforcement learning techniques like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO).

2. Novel Approach in RLHF with GRPO

DeepSeek-R1 leverages GRPO in its reinforcement learning from human feedback (RLHF) process to optimize its policy without requiring an extensive supervised fine-tuning phase. Unlike traditional RLHF methods that rely on extensive human-annotated datasets, DeepSeek-R1 utilizes self-evolution and rule-based reward models to enhance its reasoning capabilities autonomously. The results demonstrate that RL, when properly structured, can lead to impressive model performance improvements without direct human intervention at every step.

3. What is DPO and PPO in Technical Details?

Proximal Policy Optimization (PPO): PPO is a reinforcement learning algorithm that updates policies in a stable and efficient manner. It uses a clipped surrogate objective to ensure that the updated policy does not deviate too much from the previous policy, thus maintaining training stability. The objective function involves maximizing the expected advantage while ensuring that the probability ratio between the old and new policies remains within a defined threshold.

Added comments for better understanding..

Direct Preference Optimization (DPO): DPO simplifies RLHF by directly optimizing preferences rather than relying on reward models. Instead of estimating rewards through a learned function, DPO reformulates the optimization process as a classification problem between preferred and non-preferred responses. This method can be more sample-efficient but may lack the flexibility of reward-driven approaches like PPO.

4. Explaining GRPO in Technical Details

GRPO is a refinement of PPO that integrates relative comparisons across multiple generated outputs to optimize policy updates. It seeks to balance exploration and exploitation more effectively by considering the relative rankings of different model responses rather than absolute preference scores. This leads to a more structured reinforcement signal, improving sample efficiency and stability during training.

5. The GRPO Formula and Key Components

Formula

The GRPO objective function is:

Where:

领英推荐

Reinforcement Learning in Modern AI Applications and…

Pratibha Kumari J. 9 个月前

Decoding "I Can't Afford It": NLP-Driven Strategies to…

Jonathan Newton 1 个月前

NLP & Leadership - A guide with tools

Ridhima Dua 1 个月前

???? is the policy being optimized.
???? is the advantage function.
clip it employs KL divergence constraints to ensure policy stability
?? is hyperparameters controlling divergence constraints

Rewards and Penalties

GRPO assigns rewards based on both absolute correctness and relative ranking within a batch of outputs. This ensures that models are optimized based on comparative reasoning rather than just individual samples.

Logarithmic Tuning & Clipping

Logarithmic tuning ensures stable updates by weighting probabilities logarithmically, preventing extreme policy shifts. Clipping mechanisms similar to PPO are applied to constrain policy updates within a safe range.

Advantage Parameter

The advantage function estimates the benefit of choosing one token sequence over another. GRPO leverages a simplified advantage computation method, reducing dependency on complex value function estimations.

KL Divergence Regularization

A KL penalty term keeps the new policy from deviating too far from the reference policy. This prevents over-optimization that may lead to reward hacking, where the model exploits the reward function instead of genuinely improving.

6. Reward Modeling and Rule-Based Reward System

Traditional RLHF uses learned reward models, but DeepSeek-R1 employs a rule-based reward system. This system assigns rewards based on predefined evaluation criteria, ensuring stability and interpretability. For example:

Mathematical problem solving: A solution is rewarded if it matches the correct answer.
Code generation: The output is rewarded if it compiles and executes successfully.
Reasoning and structure: The model receives additional rewards for following structured reasoning formats (e.g., presenting logical steps before the final answer).

This rule-based approach eliminates biases that might arise from human-annotated preferences and allows for more objective evaluation criteria.

7. Training Template

DeepSeek-R1 follows a structured training template where the base model is guided to output reasoning steps before producing the final answer. This structure encourages systematic problem-solving without enforcing specific heuristics or biases. The training process includes:

Generating multiple responses for each input.
Comparing and ranking responses using a rule-based reward model.
Applying GRPO updates to optimize policy decisions iteratively.
Using KL divergence constraints to ensure gradual learning without drastic behavioral shifts.

Conclusion

DeepSeek-R1 use of GRPO demonstrates the power of structured reinforcement learning in fine-tuning language models. By leveraging relative comparisons, rule-based reward systems, and structured reasoning templates, DeepSeek has set a new benchmark for efficient and scalable RLHF methodologies. The insights from GRPO provide valuable directions for future research in reinforcement learning and AI alignment.

Vidyadhar (VK) Kamat

Mentor & consultant - Analytics & Insights, RPA Practice, Digital Transformation Journey for Phygital Business

1 个月

Zakir, nice narration. Would love to hear from you, why it's critical to build our own model instead top-up on existing libraries.

1 次回应

查看更多评论

要查看或添加评论，请登录

Zahir Shaikh的更多文章

Enterprise Ready? Overcoming the Hidden Hurdles of Generative AI

2025年3月19日

Enterprise Ready? Overcoming the Hidden Hurdles of Generative AI

Introduction Enterprises are increasingly exploring generative AI to improve productivity, customer service, and…
Comprehensive Guide to Installing Kubeflow Locally on Ubuntu 22.04

2025年1月26日

Comprehensive Guide to Installing Kubeflow Locally on Ubuntu 22.04

Kubeflow is a powerful open-source platform designed for running machine learning workflows on Kubernetes. While…
How to Win in 2025 with Open-Source AI

2025年1月2日

How to Win in 2025 with Open-Source AI

Introduction Open-source AI has made impressive strides, matching or even surpassing older closed-source models. Yet…

1 条评论
Unlocking the Power of pgVector: Distance Functions and Indexing Explained

2024年12月22日

Unlocking the Power of pgVector: Distance Functions and Indexing Explained

PostgreSQL is a powerhouse for relational data, but with the rise of machine learning and AI, managing and querying…

1 条评论
AI Agents: TapeAgent from ServiceNow AI Research

2024年11月28日

AI Agents: TapeAgent from ServiceNow AI Research

An In-Depth Exploration with a Short PoC AI agent development and deployment are advancing rapidly, driven by the…
Exploring Microsoft TinyTroupe: A Framework for Generative Agent Collaboration

2024年11月15日

Exploring Microsoft TinyTroupe: A Framework for Generative Agent Collaboration

TinyTroupe framework by Microsoft is a Python library designed to create generative agent systems, where AI-powered…
?? Basics of Docker, Kubernetes, and Helm for Generative AI Applications (Try it on Ubuntu)

2024年10月26日

?? Basics of Docker, Kubernetes, and Helm for Generative AI Applications (Try it on Ubuntu)

Generative AI is transforming industries by enabling automated content creation, intelligent assistance, and…
From Reasoning to Action: Understanding AI Agents With Simple Program

2024年10月15日

From Reasoning to Action: Understanding AI Agents With Simple Program

Artificial Intelligence (AI) continues to evolve, and one of the most exciting developments is the concept of AI…
Improving RAG Search with Reranking: Try with simple python program

2024年10月9日

Improving RAG Search with Reranking: Try with simple python program

Retrieval-Augmented Generation (RAG) has gained significant traction in enhancing the capabilities of generative AI…
Understanding LoRA (Low-Rank Adaptation) with simple example in Pytorch

2024年10月8日

Understanding LoRA (Low-Rank Adaptation) with simple example in Pytorch

In deep learning, fine-tuning pre-trained models for specific tasks has become a common practice. However, traditional…

1 条评论

See all articles

Group Relative Policy Optimization (GRPO) in Reinforcement Learning from Human Feedback (RLHF): Insights from DeepSeek

Zahir Shaikh

Lead (Generative AI / Automation) @ T-Systems | Specializing in Automation, Large Language Models (LLM), LLAMA Index, Langchain | Expert in Deep Learning, Machine Learning, NLP, Vector Databases | RPA

1. Introduction to the Buzz About DeepSeek

2. Novel Approach in RLHF with GRPO

3. What is DPO and PPO in Technical Details?

4. Explaining GRPO in Technical Details

5. The GRPO Formula and Key Components

Formula

领英推荐

Rewards and Penalties

Logarithmic Tuning & Clipping

Advantage Parameter

KL Divergence Regularization

6. Reward Modeling and Rule-Based Reward System

7. Training Template

Conclusion

Zahir Shaikh的更多文章

社区洞察

其他会员也浏览了

Reinforcement learning and Mixture of Experts in Deepseek R1 a disruptor?

The DeepSeek-R1 Breakthrough: Reinforcement Learning with Rule-Based Rewards

BxD Primer Series: SARSA Reinforcement Learning Models

Decoding Consumer Psychology: Neuro-Linguistic Programming (NLP) for Enhanced Marketing Strategies

Mastering IT Support with NLP: Enhancing Troubleshooting through Soft Skills

Demystifying Neuro-Linguistic Programming and How It Matters in Transformation

Reinforcement Learning in the Real World: How AI Learns to Roll with the Punches

Human-Guided Reinforcement Learning: Exploring Techniques, Real-World Applications, and Ethical Implications

Untangling AI: Transformative Applications in Learning and Development

How NLP Enhances the Impact of Your Sales Training

1. Introduction to the Buzz About DeepSeek

2. Novel Approach in RLHF with GRPO

3. What is DPO and PPO in Technical Details?

4. Explaining GRPO in Technical Details

5. The GRPO Formula and Key Components

Formula

领英推荐

Rewards and Penalties

Logarithmic Tuning & Clipping

Advantage Parameter

KL Divergence Regularization

6. Reward Modeling and Rule-Based Reward System

7. Training Template

Conclusion

Zahir Shaikh的更多文章

Enterprise Ready? Overcoming the Hidden Hurdles of Generative AI

Comprehensive Guide to Installing Kubeflow Locally on Ubuntu 22.04

How to Win in 2025 with Open-Source AI

Unlocking the Power of pgVector: Distance Functions and Indexing Explained

AI Agents: TapeAgent from ServiceNow AI Research

Exploring Microsoft TinyTroupe: A Framework for Generative Agent Collaboration

?? Basics of Docker, Kubernetes, and Helm for Generative AI Applications (Try it on Ubuntu)

From Reasoning to Action: Understanding AI Agents With Simple Program

Improving RAG Search with Reranking: Try with simple python program

Understanding LoRA (Low-Rank Adaptation) with simple example in Pytorch

社区洞察

其他会员也浏览了

Reinforcement learning and Mixture of Experts in Deepseek R1 a disruptor?

The DeepSeek-R1 Breakthrough: Reinforcement Learning with Rule-Based Rewards

BxD Primer Series: SARSA Reinforcement Learning Models

Decoding Consumer Psychology: Neuro-Linguistic Programming (NLP) for Enhanced Marketing Strategies

Mastering IT Support with NLP: Enhancing Troubleshooting through Soft Skills

Demystifying Neuro-Linguistic Programming and How It Matters in Transformation

Reinforcement Learning in the Real World: How AI Learns to Roll with the Punches

Human-Guided Reinforcement Learning: Exploring Techniques, Real-World Applications, and Ethical Implications

Untangling AI: Transformative Applications in Learning and Development

How NLP Enhances the Impact of Your Sales Training