Group Relative Policy Optimization (GRPO) in Reinforcement Learning from Human Feedback (RLHF): Insights from DeepSeek
Zahir Shaikh
Lead (Generative AI / Automation) @ T-Systems | Specializing in Automation, Large Language Models (LLM), LLAMA Index, Langchain | Expert in Deep Learning, Machine Learning, NLP, Vector Databases | RPA
1. Introduction to the Buzz About DeepSeek
DeepSeek-R1-Zero has been making waves in the AI research community with its novel approach to reinforcement learning (RL). It stands out due to its ability to self-evolve without explicit supervision, achieving remarkable results on benchmarks such as AIME 2024. The introduction of Group Relative Policy Optimization (GRPO) has played a crucial role in this success, offering an alternative to existing reinforcement learning techniques like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO).
2. Novel Approach in RLHF with GRPO
DeepSeek-R1 leverages GRPO in its reinforcement learning from human feedback (RLHF) process to optimize its policy without requiring an extensive supervised fine-tuning phase. Unlike traditional RLHF methods that rely on extensive human-annotated datasets, DeepSeek-R1 utilizes self-evolution and rule-based reward models to enhance its reasoning capabilities autonomously. The results demonstrate that RL, when properly structured, can lead to impressive model performance improvements without direct human intervention at every step.
3. What is DPO and PPO in Technical Details?
4. Explaining GRPO in Technical Details
GRPO is a refinement of PPO that integrates relative comparisons across multiple generated outputs to optimize policy updates. It seeks to balance exploration and exploitation more effectively by considering the relative rankings of different model responses rather than absolute preference scores. This leads to a more structured reinforcement signal, improving sample efficiency and stability during training.
5. The GRPO Formula and Key Components
Formula
The GRPO objective function is:
Where:
领英推荐
Rewards and Penalties
GRPO assigns rewards based on both absolute correctness and relative ranking within a batch of outputs. This ensures that models are optimized based on comparative reasoning rather than just individual samples.
Logarithmic Tuning & Clipping
Logarithmic tuning ensures stable updates by weighting probabilities logarithmically, preventing extreme policy shifts. Clipping mechanisms similar to PPO are applied to constrain policy updates within a safe range.
Advantage Parameter
The advantage function estimates the benefit of choosing one token sequence over another. GRPO leverages a simplified advantage computation method, reducing dependency on complex value function estimations.
KL Divergence Regularization
A KL penalty term keeps the new policy from deviating too far from the reference policy. This prevents over-optimization that may lead to reward hacking, where the model exploits the reward function instead of genuinely improving.
6. Reward Modeling and Rule-Based Reward System
Traditional RLHF uses learned reward models, but DeepSeek-R1 employs a rule-based reward system. This system assigns rewards based on predefined evaluation criteria, ensuring stability and interpretability. For example:
This rule-based approach eliminates biases that might arise from human-annotated preferences and allows for more objective evaluation criteria.
7. Training Template
DeepSeek-R1 follows a structured training template where the base model is guided to output reasoning steps before producing the final answer. This structure encourages systematic problem-solving without enforcing specific heuristics or biases. The training process includes:
Conclusion
DeepSeek-R1 use of GRPO demonstrates the power of structured reinforcement learning in fine-tuning language models. By leveraging relative comparisons, rule-based reward systems, and structured reasoning templates, DeepSeek has set a new benchmark for efficient and scalable RLHF methodologies. The insights from GRPO provide valuable directions for future research in reinforcement learning and AI alignment.
Mentor & consultant - Analytics & Insights, RPA Practice, Digital Transformation Journey for Phygital Business
1 个月Zakir, nice narration. Would love to hear from you, why it's critical to build our own model instead top-up on existing libraries.