Group Relative Policy Optimization (GRPO) in Reinforcement Learning from Human Feedback (RLHF): Insights from DeepSeek

Group Relative Policy Optimization (GRPO) in Reinforcement Learning from Human Feedback (RLHF): Insights from DeepSeek

1. Introduction to the Buzz About DeepSeek

DeepSeek-R1-Zero has been making waves in the AI research community with its novel approach to reinforcement learning (RL). It stands out due to its ability to self-evolve without explicit supervision, achieving remarkable results on benchmarks such as AIME 2024. The introduction of Group Relative Policy Optimization (GRPO) has played a crucial role in this success, offering an alternative to existing reinforcement learning techniques like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO).

2. Novel Approach in RLHF with GRPO

DeepSeek-R1 leverages GRPO in its reinforcement learning from human feedback (RLHF) process to optimize its policy without requiring an extensive supervised fine-tuning phase. Unlike traditional RLHF methods that rely on extensive human-annotated datasets, DeepSeek-R1 utilizes self-evolution and rule-based reward models to enhance its reasoning capabilities autonomously. The results demonstrate that RL, when properly structured, can lead to impressive model performance improvements without direct human intervention at every step.

3. What is DPO and PPO in Technical Details?

  • Proximal Policy Optimization (PPO): PPO is a reinforcement learning algorithm that updates policies in a stable and efficient manner. It uses a clipped surrogate objective to ensure that the updated policy does not deviate too much from the previous policy, thus maintaining training stability. The objective function involves maximizing the expected advantage while ensuring that the probability ratio between the old and new policies remains within a defined threshold.

Added comments for better understanding..

  • Direct Preference Optimization (DPO): DPO simplifies RLHF by directly optimizing preferences rather than relying on reward models. Instead of estimating rewards through a learned function, DPO reformulates the optimization process as a classification problem between preferred and non-preferred responses. This method can be more sample-efficient but may lack the flexibility of reward-driven approaches like PPO.


4. Explaining GRPO in Technical Details

GRPO is a refinement of PPO that integrates relative comparisons across multiple generated outputs to optimize policy updates. It seeks to balance exploration and exploitation more effectively by considering the relative rankings of different model responses rather than absolute preference scores. This leads to a more structured reinforcement signal, improving sample efficiency and stability during training.

5. The GRPO Formula and Key Components

Formula

The GRPO objective function is:


Where:

  • ???? is the policy being optimized.
  • ???? is the advantage function.
  • clip it employs KL divergence constraints to ensure policy stability
  • ?? is hyperparameters controlling divergence constraints

Rewards and Penalties

GRPO assigns rewards based on both absolute correctness and relative ranking within a batch of outputs. This ensures that models are optimized based on comparative reasoning rather than just individual samples.

Logarithmic Tuning & Clipping

Logarithmic tuning ensures stable updates by weighting probabilities logarithmically, preventing extreme policy shifts. Clipping mechanisms similar to PPO are applied to constrain policy updates within a safe range.

Advantage Parameter

The advantage function estimates the benefit of choosing one token sequence over another. GRPO leverages a simplified advantage computation method, reducing dependency on complex value function estimations.

KL Divergence Regularization

A KL penalty term keeps the new policy from deviating too far from the reference policy. This prevents over-optimization that may lead to reward hacking, where the model exploits the reward function instead of genuinely improving.

6. Reward Modeling and Rule-Based Reward System

Traditional RLHF uses learned reward models, but DeepSeek-R1 employs a rule-based reward system. This system assigns rewards based on predefined evaluation criteria, ensuring stability and interpretability. For example:

  • Mathematical problem solving: A solution is rewarded if it matches the correct answer.
  • Code generation: The output is rewarded if it compiles and executes successfully.
  • Reasoning and structure: The model receives additional rewards for following structured reasoning formats (e.g., presenting logical steps before the final answer).

This rule-based approach eliminates biases that might arise from human-annotated preferences and allows for more objective evaluation criteria.

7. Training Template

DeepSeek-R1 follows a structured training template where the base model is guided to output reasoning steps before producing the final answer. This structure encourages systematic problem-solving without enforcing specific heuristics or biases. The training process includes:

  1. Generating multiple responses for each input.
  2. Comparing and ranking responses using a rule-based reward model.
  3. Applying GRPO updates to optimize policy decisions iteratively.
  4. Using KL divergence constraints to ensure gradual learning without drastic behavioral shifts.

Conclusion

DeepSeek-R1 use of GRPO demonstrates the power of structured reinforcement learning in fine-tuning language models. By leveraging relative comparisons, rule-based reward systems, and structured reasoning templates, DeepSeek has set a new benchmark for efficient and scalable RLHF methodologies. The insights from GRPO provide valuable directions for future research in reinforcement learning and AI alignment.

Vidyadhar (VK) Kamat

Mentor & consultant - Analytics & Insights, RPA Practice, Digital Transformation Journey for Phygital Business

1 个月

Zakir, nice narration. Would love to hear from you, why it's critical to build our own model instead top-up on existing libraries.

要查看或添加评论,请登录

Zahir Shaikh的更多文章

社区洞察

其他会员也浏览了