deepseek : From PPO to GRPO, Transforming RL Fine-Tuning for Large Language Models
deepseek

deepseek : From PPO to GRPO, Transforming RL Fine-Tuning for Large Language Models

When it comes to Reinforcement Learning (RL) for large language models (LLMs), Proximal Policy Optimization (PPO) has been a go-to method. However, a more recent approach which ddepseek introduced is called Group Relative Policy Optimization (GRPO) which aims to reduce complexity and resource usage—particularly by cutting out the separate “value network” that PPO traditionally requires. This article provides a thorough, plain-language comparison between PPO and GRPO, explaining each step in detail.


1. Quick Refresher on Reinforcement Learning (RL)

In Reinforcement Learning, an agent interacts with an environment and learns how to take actions that maximize a numerical reward. Think of training a dog to do a trick:

  • The environment is your home or training field.
  • The dog observes the environment (seeing a treat in your hand).
  • It takes an action (sit, bark, stand, etc.).
  • You give a reward (or not) based on how good the action was.
  • Over time, the dog figures out which actions get the most treats.

In RL terms:

  • We call each decision point a “state.”
  • The action is the choice the agent makes (e.g., next token to generate in an LLM).
  • The reward is feedback from the environment (or from a reward model).
  • The policy is the function that maps states to actions—basically a model that tells us the probability of each possible action in any given state.


2. The Basics of Policy Gradient Methods

A policy in RL is simply a function that returns a probability distribution over possible actions in a given state. Policy gradient methods directly adjust (via gradient ascent) the parameters of this policy in order to improve performance (i.e., to get higher total reward).

One key challenge, however, is that large, unconstrained updates to the policy can be destructive. If you drastically change the agent’s behavior in one step, it can forget what it learned or wander into bad exploration patterns.


3. Proximal Policy Optimization (PPO)

Clipped Objective PPO

3.1 Why “Proximal”?

PPO aims to keep each new version of the policy close to the old version (the “old policy”) so as to avoid huge, destabilising jumps. This “proximity” is enforced by a clipped objective: when PPO updates the policy, it prevents the probability of certain actions from changing too much in one update cycle.


3.2 Clipped Objective (in Words)

In PPO, we compare:

  • The “new policy’s” probability of taking a certain action at a certain time
  • Versus the “old policy’s” probability of taking that same action

We form a ratio (new policy probability / old policy probability). If the ratio is above 1, the new policy is more likely to take that action; if below 1, it’s less likely.

We also compute an advantage function, which tells us how much better (or worse) a specific action was compared to an average action in that situation. If the advantage is positive, we want to slightly increase the policy’s probability of that action. If it’s negative, we want to slightly decrease it.

However, to avoid huge spikes, PPO clips that ratio to a small range (for instance, between 0.8 and 1.2). The final PPO objective takes the minimum between the unclipped version (which might encourage a big update) and the clipped version (which holds the update in check). This ensures the new policy stays in a "trusted" zone, not straying too far from the old policy.


3.3 Key Advantages of PPO

  • Stability: By preventing large updates, PPO often trains more robustly.
  • Simplicity: Requires only gradient-based updates (no fancy second-order optimisation).
  • Sample Efficiency: You can reuse collected data for multiple gradient updates, saving on how many new samples from the environment are needed.


3.4 Common PPO Setup for Language Models

When fine-tuning Large Language Models, people often:

  • Use a reward model that scores how “good” a generated response is.
  • Include a reference model to penalize divergence (via KL divergence) from a baseline (often a supervised fine-tuned model).
  • Train a separate value network (the “critic”) to estimate how much future reward to expect at each token, which helps compute the advantage more accurately (often via something like Generalized Advantage Estimation, GAE).

But this extra value model can be expensive—especially with large LLMs—since it can be almost as big as the main policy.


4. Motivation for Group Relative Policy Optimization (GRPO)

GRPO Objective

GRPO emerges as a direct response to the overhead of training a large value network. Instead of learning a value function for advantage estimation, GRPO proposes using group-based comparisons of different outputs (answers) to the same prompt.


Two Big Motivations:

  1. Less Compute/Memory: No separate value network.
  2. Aligns with Comparison-Based Rewards: Many LLM reward models are trained to compare pairs (or sets) of outputs—“Which answer is better?”—so grouping multiple outputs for the same prompt fits nicely.


5. How GRPO Works

GRPO Algorithm

5.1 Group Sampling

Suppose you have a question q. , what GRPO does:

  • Samples a group of outputs (G outputs) from the old policy for the same question.
  • Each output gets a scalar score (reward) from the reward model. For instance, you might have G different candidate answers, each assigned a reward score like 1.2, 0.8, -0.3, etc.


5.2 Group-Based Baseline (No Value Network!)

In PPO, you typically subtract a value estimate (baseline) to stabilize gradient updates. In GRPO, you instead compute the mean of all the rewards in the group and optionally the standard deviation. Each output’s reward is then normalised (subtract the mean, divide by the std). This becomes its advantage: “How much better or worse than average did this output do?”

If your answer is better than the average of your group, you get a positive advantage. If it’s worse, you get a negative advantage.


5.3 Objective: Still “Proximal,” but No Critic

GRPO keeps a ratio-based update (new vs. old policy probabilities) to avoid big jumps. Then it applies the same kind of clipping as in PPO to keep updates within a trust region. Instead of adding the KL penalty token-by-token to the reward, GRPO often includes a separate term that penalizes divergence from a reference model overall.

In short: GRPO’s “surrogate objective” looks a lot like PPO’s, but there’s no term from a learned value network. Instead, the advantage is derived from group-relative rewards.


5.4 Outcome vs. Process Supervision

  • Outcome Supervision: You give a final reward for the entire generated answer. Every token in that answer shares the same advantage score. Simpler, but less precise if the reasoning is multi-step.
  • Process Supervision: You can assign partial rewards to each step in the chain of thought, so tokens that led to correct reasoning get positive signals. More complex but potentially more informative.


5.5 Iterative Updates

Over time, you might retrain or continually update the reward model to keep it in sync with your improving policy. The policy, reference model, and reward model can leapfrog each other’s improvements.


6. PPO vs. GRPO: Detailed Comparison


PPO vs GRPO



A. Value Function vs. Group Baseline

  • PPO: Requires a big “critic / value” network to estimate baseline values. This can be almost as large as the main policy, costing extra memory and compute.
  • GRPO: No separate critic. The “baseline” is simply the group’s average reward. If your reward model inherently scores or compares multiple outputs, this synergy is very natural.

B. KL Penalty

  • PPO (in LLMs): Often lumps the KL penalty into each token’s reward.
  • GRPO: Separates the KL penalty as a standalone regulariser. Advantage calculation stays “pure” (just from group scores), and the KL term is subtracted at the end.

C. Advantage Estimation

  • PPO: Often uses GAE (Generalised Advantage Estimation). This can be tricky to tune (requires discount factors, etc.) and can be noisy for long sequences.
  • GRPO: Advantage is simply how your reward stacks up relative to the other outputs in your group. Straightforward, especially if you only have a final reward.

D. Resource Requirements

  • PPO: Double model overhead (policy + value).
  • GRPO: Single big model (the policy) plus a reference model. The reward model is used for scoring but typically was needed anyway for reward shaping. You skip the large critic architecture.

E. Performance & Stability

  • PPO: Has been a workhorse in many RL tasks and is well-tested, but learning a value function for LLM tasks can be high-variance and slow.
  • GRPO: In tasks like math problem-solving (GSM8K, MATH), GRPO has shown good results without heavy overhead. By relying on direct reward comparisons, it can be more stable when you only have final or sparse rewards.

F. When Might PPO Still Be Preferred?

  • If you have very dense rewards or many intermediate signals that a value network can learn from, PPO might generalize better.
  • If generating multiple outputs per prompt (to form a group) is harder in your setup, PPO might be simpler (though, for many LLM applications, generating multiple answers is often standard for comparison-based data anyway).


7. Strengths, Weaknesses, and Final Takeaways

Strengths of PPO

  • Well-established in RL
  • Clipped updates improve stability
  • Reusable code and well-tested in practice

Weaknesses of PPO

  • Requires a separate value network, which is memory- and compute-intensive
  • Still needs careful tuning (discount rates, advantage estimation hyperparameters)

Strengths of GRPO

  • No big critic: Lower resource usage
  • Leverages group comparisons, aligning well with how many LLM reward models are trained (i.e., pairwise or group wise preference comparisons)
  • Potentially simpler advantage calculation

Weaknesses of GRPO

  • You must generate multiple outputs per prompt to get a meaningful group average
  • If your setting provides dense intermediate rewards, the group-based approach might capture less nuance than a well-trained value model


8. Putting It All Together

  • PPO was groundbreaking for stabilising policy gradients by limiting how much you can change the policy each step. It’s widely used in RL tasks, including language modelling, but it typically carries the cost of training a large value model.
  • GRPO adopts the same “proximal” idea but replaces the separate value model with group-based advantage computation. If you’re already generating multiple candidate answers for each query and using a reward model that compares them, GRPO might streamline your training pipeline significantly.







Nishi P.

MS in CS at UMass Amherst

1 个月

Very informative!

回复

要查看或添加评论,请登录

Abhishake Yadav的更多文章

社区洞察

其他会员也浏览了