登录查看更多内容

deepseek : From PPO to GRPO, Transforming RL Fine-Tuning for Large Language Models

Abhishake Yadav

Using data analysis to make decisions, an analytical approach to business leadership

发布日期: 2025年1月28日

When it comes to Reinforcement Learning (RL) for large language models (LLMs), Proximal Policy Optimization (PPO) has been a go-to method. However, a more recent approach which ddepseek introduced is called Group Relative Policy Optimization (GRPO) which aims to reduce complexity and resource usage—particularly by cutting out the separate “value network” that PPO traditionally requires. This article provides a thorough, plain-language comparison between PPO and GRPO, explaining each step in detail.

1. Quick Refresher on Reinforcement Learning (RL)

In Reinforcement Learning, an agent interacts with an environment and learns how to take actions that maximize a numerical reward. Think of training a dog to do a trick:

The environment is your home or training field.
The dog observes the environment (seeing a treat in your hand).
It takes an action (sit, bark, stand, etc.).
You give a reward (or not) based on how good the action was.
Over time, the dog figures out which actions get the most treats.

In RL terms:

We call each decision point a “state.”
The action is the choice the agent makes (e.g., next token to generate in an LLM).
The reward is feedback from the environment (or from a reward model).
The policy is the function that maps states to actions—basically a model that tells us the probability of each possible action in any given state.

2. The Basics of Policy Gradient Methods

A policy in RL is simply a function that returns a probability distribution over possible actions in a given state. Policy gradient methods directly adjust (via gradient ascent) the parameters of this policy in order to improve performance (i.e., to get higher total reward).

One key challenge, however, is that large, unconstrained updates to the policy can be destructive. If you drastically change the agent’s behavior in one step, it can forget what it learned or wander into bad exploration patterns.

3. Proximal Policy Optimization (PPO)

3.1 Why “Proximal”?

PPO aims to keep each new version of the policy close to the old version (the “old policy”) so as to avoid huge, destabilising jumps. This “proximity” is enforced by a clipped objective: when PPO updates the policy, it prevents the probability of certain actions from changing too much in one update cycle.

3.2 Clipped Objective (in Words)

In PPO, we compare:

The “new policy’s” probability of taking a certain action at a certain time
Versus the “old policy’s” probability of taking that same action

We form a ratio (new policy probability / old policy probability). If the ratio is above 1, the new policy is more likely to take that action; if below 1, it’s less likely.

We also compute an advantage function, which tells us how much better (or worse) a specific action was compared to an average action in that situation. If the advantage is positive, we want to slightly increase the policy’s probability of that action. If it’s negative, we want to slightly decrease it.

However, to avoid huge spikes, PPO clips that ratio to a small range (for instance, between 0.8 and 1.2). The final PPO objective takes the minimum between the unclipped version (which might encourage a big update) and the clipped version (which holds the update in check). This ensures the new policy stays in a "trusted" zone, not straying too far from the old policy.

3.3 Key Advantages of PPO

Stability: By preventing large updates, PPO often trains more robustly.
Simplicity: Requires only gradient-based updates (no fancy second-order optimisation).
Sample Efficiency: You can reuse collected data for multiple gradient updates, saving on how many new samples from the environment are needed.

3.4 Common PPO Setup for Language Models

When fine-tuning Large Language Models, people often:

Use a reward model that scores how “good” a generated response is.
Include a reference model to penalize divergence (via KL divergence) from a baseline (often a supervised fine-tuned model).
Train a separate value network (the “critic”) to estimate how much future reward to expect at each token, which helps compute the advantage more accurately (often via something like Generalized Advantage Estimation, GAE).

But this extra value model can be expensive—especially with large LLMs—since it can be almost as big as the main policy.

4. Motivation for Group Relative Policy Optimization (GRPO)

GRPO emerges as a direct response to the overhead of training a large value network. Instead of learning a value function for advantage estimation, GRPO proposes using group-based comparisons of different outputs (answers) to the same prompt.

Two Big Motivations:

Less Compute/Memory: No separate value network.
Aligns with Comparison-Based Rewards: Many LLM reward models are trained to compare pairs (or sets) of outputs—“Which answer is better?”—so grouping multiple outputs for the same prompt fits nicely.

5. How GRPO Works

5.1 Group Sampling

Suppose you have a question q. , what GRPO does:

Samples a group of outputs (G outputs) from the old policy for the same question.
Each output gets a scalar score (reward) from the reward model. For instance, you might have G different candidate answers, each assigned a reward score like 1.2, 0.8, -0.3, etc.

5.2 Group-Based Baseline (No Value Network!)

In PPO, you typically subtract a value estimate (baseline) to stabilize gradient updates. In GRPO, you instead compute the mean of all the rewards in the group and optionally the standard deviation. Each output’s reward is then normalised (subtract the mean, divide by the std). This becomes its advantage: “How much better or worse than average did this output do?”

If your answer is better than the average of your group, you get a positive advantage. If it’s worse, you get a negative advantage.

领英推荐

A Complete Beginner's Guide to Rapport Building in NLP

David Oscar 9 个月前

Decoding "I Can't Afford It": NLP-Driven Strategies to…

Jonathan Newton 1 个月前

Boost Conversions: Mastering Powerful Writing

Renée Cormier 11 个月前

5.3 Objective: Still “Proximal,” but No Critic

GRPO keeps a ratio-based update (new vs. old policy probabilities) to avoid big jumps. Then it applies the same kind of clipping as in PPO to keep updates within a trust region. Instead of adding the KL penalty token-by-token to the reward, GRPO often includes a separate term that penalizes divergence from a reference model overall.

In short: GRPO’s “surrogate objective” looks a lot like PPO’s, but there’s no term from a learned value network. Instead, the advantage is derived from group-relative rewards.

5.4 Outcome vs. Process Supervision

Outcome Supervision: You give a final reward for the entire generated answer. Every token in that answer shares the same advantage score. Simpler, but less precise if the reasoning is multi-step.
Process Supervision: You can assign partial rewards to each step in the chain of thought, so tokens that led to correct reasoning get positive signals. More complex but potentially more informative.

5.5 Iterative Updates

Over time, you might retrain or continually update the reward model to keep it in sync with your improving policy. The policy, reference model, and reward model can leapfrog each other’s improvements.

6. PPO vs. GRPO: Detailed Comparison

A. Value Function vs. Group Baseline

PPO: Requires a big “critic / value” network to estimate baseline values. This can be almost as large as the main policy, costing extra memory and compute.
GRPO: No separate critic. The “baseline” is simply the group’s average reward. If your reward model inherently scores or compares multiple outputs, this synergy is very natural.

B. KL Penalty

PPO (in LLMs): Often lumps the KL penalty into each token’s reward.
GRPO: Separates the KL penalty as a standalone regulariser. Advantage calculation stays “pure” (just from group scores), and the KL term is subtracted at the end.

C. Advantage Estimation

PPO: Often uses GAE (Generalised Advantage Estimation). This can be tricky to tune (requires discount factors, etc.) and can be noisy for long sequences.
GRPO: Advantage is simply how your reward stacks up relative to the other outputs in your group. Straightforward, especially if you only have a final reward.

D. Resource Requirements

PPO: Double model overhead (policy + value).
GRPO: Single big model (the policy) plus a reference model. The reward model is used for scoring but typically was needed anyway for reward shaping. You skip the large critic architecture.

E. Performance & Stability

PPO: Has been a workhorse in many RL tasks and is well-tested, but learning a value function for LLM tasks can be high-variance and slow.
GRPO: In tasks like math problem-solving (GSM8K, MATH), GRPO has shown good results without heavy overhead. By relying on direct reward comparisons, it can be more stable when you only have final or sparse rewards.

F. When Might PPO Still Be Preferred?

If you have very dense rewards or many intermediate signals that a value network can learn from, PPO might generalize better.
If generating multiple outputs per prompt (to form a group) is harder in your setup, PPO might be simpler (though, for many LLM applications, generating multiple answers is often standard for comparison-based data anyway).

7. Strengths, Weaknesses, and Final Takeaways

Strengths of PPO

Well-established in RL
Clipped updates improve stability
Reusable code and well-tested in practice

Weaknesses of PPO

Requires a separate value network, which is memory- and compute-intensive
Still needs careful tuning (discount rates, advantage estimation hyperparameters)

Strengths of GRPO

No big critic: Lower resource usage
Leverages group comparisons, aligning well with how many LLM reward models are trained (i.e., pairwise or group wise preference comparisons)
Potentially simpler advantage calculation

Weaknesses of GRPO

You must generate multiple outputs per prompt to get a meaningful group average
If your setting provides dense intermediate rewards, the group-based approach might capture less nuance than a well-trained value model

8. Putting It All Together

PPO was groundbreaking for stabilising policy gradients by limiting how much you can change the policy each step. It’s widely used in RL tasks, including language modelling, but it typically carries the cost of training a large value model.
GRPO adopts the same “proximal” idea but replaces the separate value model with group-based advantage computation. If you’re already generating multiple candidate answers for each query and using a reward model that compares them, GRPO might streamline your training pipeline significantly.

Nishi P.

MS in CS at UMass Amherst

1 个月

Very informative!

要查看或添加评论，请登录

Abhishake Yadav的更多文章

SAP and Databricks: The Game-Changing Partnership Shaping the Future of Enterprise Data and AI

2025年2月26日

SAP and Databricks: The Game-Changing Partnership Shaping the Future of Enterprise Data and AI

Introduction If there’s one thing that virtually every digital leader craves today, it’s the ability to unify their…

2 条评论
Transforming Customer Support with Retrieval-Augmented Generation (RAG) on SAP BTP

2024年6月16日

Transforming Customer Support with Retrieval-Augmented Generation (RAG) on SAP BTP

In today's fast-paced business environment, providing exceptional customer support is more critical than ever…

2 条评论
Unlocking the Power of Data Storytelling for SAP Professionals: A Comprehensive Guide

2024年6月16日

Unlocking the Power of Data Storytelling for SAP Professionals: A Comprehensive Guide

In the ever-evolving landscape of data science, one principle remains timeless: the art of storytelling. We have all…

2 条评论
Unleashing the Dark Side of AI: Safeguarding Your Digital Fortress Against Cybercrime

2023年5月2日

Unleashing the Dark Side of AI: Safeguarding Your Digital Fortress Against Cybercrime

As the amount of cybercrime continues to increase, it is essential to evaluate and manage risk at the scale and…
Revolutionising Education: Tackling the 2 Sigma Problem

2023年5月2日

Revolutionising Education: Tackling the 2 Sigma Problem

In recent times, the potential impacts of artificial intelligence (AI) on various aspects of society have been a hot…

1 条评论
The Paradox of AI: Brilliant and Clumsy at the Same Time

2023年5月2日

The Paradox of AI: Brilliant and Clumsy at the Same Time

Artificial Intelligence (AI) has come a long way since its inception. Today, AI models can be as large as Goliath and…
Image Processing: Convolution filters and Calculation of image gradients

2023年1月18日

Image Processing: Convolution filters and Calculation of image gradients

Convolution filters are a fundamental building block in image processing and computer vision. They are used to extract…

1 条评论
KL Divergence , an intuitive and practical description

2023年1月17日

KL Divergence , an intuitive and practical description

KL Divergence, also known as Kullback-Leibler divergence, is a measure of the difference between two probability…
Efficient QC solutions for Seismic Source vessels

2021年5月2日

Efficient QC solutions for Seismic Source vessels

The marine seismic market is one of the hardest-hit sectors in the downturn. While the offshore industry is gradually…

6 条评论

See all articles

deepseek : From PPO to GRPO, Transforming RL Fine-Tuning for Large Language Models

Abhishake Yadav

Using data analysis to make decisions, an analytical approach to business leadership