deepseek : From PPO to GRPO, Transforming RL Fine-Tuning for Large Language Models
Abhishake Yadav
Using data analysis to make decisions, an analytical approach to business leadership
When it comes to Reinforcement Learning (RL) for large language models (LLMs), Proximal Policy Optimization (PPO) has been a go-to method. However, a more recent approach which ddepseek introduced is called Group Relative Policy Optimization (GRPO) which aims to reduce complexity and resource usage—particularly by cutting out the separate “value network” that PPO traditionally requires. This article provides a thorough, plain-language comparison between PPO and GRPO, explaining each step in detail.
1. Quick Refresher on Reinforcement Learning (RL)
In Reinforcement Learning, an agent interacts with an environment and learns how to take actions that maximize a numerical reward. Think of training a dog to do a trick:
In RL terms:
2. The Basics of Policy Gradient Methods
A policy in RL is simply a function that returns a probability distribution over possible actions in a given state. Policy gradient methods directly adjust (via gradient ascent) the parameters of this policy in order to improve performance (i.e., to get higher total reward).
One key challenge, however, is that large, unconstrained updates to the policy can be destructive. If you drastically change the agent’s behavior in one step, it can forget what it learned or wander into bad exploration patterns.
3. Proximal Policy Optimization (PPO)
3.1 Why “Proximal”?
PPO aims to keep each new version of the policy close to the old version (the “old policy”) so as to avoid huge, destabilising jumps. This “proximity” is enforced by a clipped objective: when PPO updates the policy, it prevents the probability of certain actions from changing too much in one update cycle.
3.2 Clipped Objective (in Words)
In PPO, we compare:
We form a ratio (new policy probability / old policy probability). If the ratio is above 1, the new policy is more likely to take that action; if below 1, it’s less likely.
We also compute an advantage function, which tells us how much better (or worse) a specific action was compared to an average action in that situation. If the advantage is positive, we want to slightly increase the policy’s probability of that action. If it’s negative, we want to slightly decrease it.
However, to avoid huge spikes, PPO clips that ratio to a small range (for instance, between 0.8 and 1.2). The final PPO objective takes the minimum between the unclipped version (which might encourage a big update) and the clipped version (which holds the update in check). This ensures the new policy stays in a "trusted" zone, not straying too far from the old policy.
3.3 Key Advantages of PPO
3.4 Common PPO Setup for Language Models
When fine-tuning Large Language Models, people often:
But this extra value model can be expensive—especially with large LLMs—since it can be almost as big as the main policy.
4. Motivation for Group Relative Policy Optimization (GRPO)
GRPO emerges as a direct response to the overhead of training a large value network. Instead of learning a value function for advantage estimation, GRPO proposes using group-based comparisons of different outputs (answers) to the same prompt.
Two Big Motivations:
5. How GRPO Works
5.1 Group Sampling
Suppose you have a question q. , what GRPO does:
5.2 Group-Based Baseline (No Value Network!)
In PPO, you typically subtract a value estimate (baseline) to stabilize gradient updates. In GRPO, you instead compute the mean of all the rewards in the group and optionally the standard deviation. Each output’s reward is then normalised (subtract the mean, divide by the std). This becomes its advantage: “How much better or worse than average did this output do?”
If your answer is better than the average of your group, you get a positive advantage. If it’s worse, you get a negative advantage.
领英推荐
5.3 Objective: Still “Proximal,” but No Critic
GRPO keeps a ratio-based update (new vs. old policy probabilities) to avoid big jumps. Then it applies the same kind of clipping as in PPO to keep updates within a trust region. Instead of adding the KL penalty token-by-token to the reward, GRPO often includes a separate term that penalizes divergence from a reference model overall.
In short: GRPO’s “surrogate objective” looks a lot like PPO’s, but there’s no term from a learned value network. Instead, the advantage is derived from group-relative rewards.
5.4 Outcome vs. Process Supervision
5.5 Iterative Updates
Over time, you might retrain or continually update the reward model to keep it in sync with your improving policy. The policy, reference model, and reward model can leapfrog each other’s improvements.
6. PPO vs. GRPO: Detailed Comparison
A. Value Function vs. Group Baseline
B. KL Penalty
C. Advantage Estimation
D. Resource Requirements
E. Performance & Stability
F. When Might PPO Still Be Preferred?
7. Strengths, Weaknesses, and Final Takeaways
Strengths of PPO
Weaknesses of PPO
Strengths of GRPO
Weaknesses of GRPO
8. Putting It All Together
MS in CS at UMass Amherst
1 个月Very informative!