登录查看更多内容

The computation of the reward within RLHF settings utilizing the TRL library

Shamane Siri

This is the Chinese translation of my profile.

发布日期: 2023年7月16日

The Huggingface Transformer Reinforcement Learning (TRL) library simplifies Reinforcement Learning from Human Feedback (RLHF) settings. Typical RLHF steps include supervised fine-tuning, reward model training, and finetuning with Proximal Policy Optimization (PPO).

The main goal of this blog post on RLHF is to explain the utilization of rewards generated by a pre-trained reward model in the PPO optimization framework.

In the RLHF setting, mapping RL concepts to language generation tasks can be clarified by considering the following key points.

Action Space: In RLHF for language generation tasks, the action space is equivalent to generating a word in the Language Model (LLM).
Time Steps: Each time step in language generation tasks represents generating a single token, such as a word or subword unit.
Episodes: In RLHF, an episode refers to a single example generation output, where the agent generates a complete piece of text based on a given input or prompt.

PPO loss function

No alt text provided for this image — PPO Loss function [Paper - Proximal Policy Optimization Algorithms Paper link - https://arxiv.org/abs/1707.06347]

Given the abundance of resources available on the basics of the PPO algorithm, this section will not cover them. Instead, the focus is on the key idea of applying PPO and the essential step of computing the advantage function at each time step, as depicted in the above figure.

To compute the advantage function at each step, a common approach is to use an estimate based on the TD (Temporal Difference) error. This estimation allows us to quantify the difference between the predicted value and the actual value observed in reinforcement learning scenarios.

In the above figure, "r" represents the reward for each time step, which is equivalent to the reward assigned to each generated token. Hence, it is necessary to compute the reward for each time step in language generation tasks.

A key question arises: How do we compute rewards per token when our reward model provides a scalar value for the entire generation?

To tackle this challenge, RLHF with LLM adopts a distinctive approach.

领英推荐

All-Inclusive Guide to Test Case Creation in testRigor

testRigor 3 个月前

What is Retrieval Augmented Fine-Tuning (RAFT)?

CapeStart 10 个月前

Building a GPT-Style LLM Classifier From Scratch

Sebastian Raschka, PhD 5 个月前

It leverages the reward for the entire output sequence and utilizes it as the reward for each individual token in the sequence.

However, this approach could have unintended consequences, such as the model attempting to generate gibberish or incoherent outputs solely to fool the reward model.

To mitigate this, we introduce a KL penalty for each token, computing the divergence between the current model's policy and the reference model's policy. The reference model is a frozen copy of the language model used for training.

This KL penalty ensures that our model does not deviate excessively from the original model.

Typically, we compute the log probabilities for the generated text from the current model and compare them to the log probabilities from the frozen model for the exact generated text.

The final reward per token is a combination of entire scaler reward and the KL divergence. Actually what happens is we only add the scaler reward to the final tokens KL divergence value. All other per token rewards come from the per-token KL divegenve between the current model and the forzen model.

The following code snippet exemplifies how the TRL library accomplishes the computation of per-token rewards, showcasing its simplicity and elegance.

Moreover, in the RLHF with LLM approach, the value function per token is computed using the AutoModelForCausalLMWithValueHead. This component is a fully connected network that generates a scalar value as output after processing the final representation for each token in the sequence. By utilizing this network architecture, the value function is estimated for individual tokens, aiding in the reinforcement learning process within the RLHF framework.

Given an episode, which represents a generated output for a given input, we now have per-token rewards, value function estimations, and action probabilities. With this information, we can seamlessly apply PPO.

Yes that is the magic behind the per token reward computation. Cheers to the future of RLHF!

要查看或添加评论，请登录

Shamane Siri的更多文章

LLM Reasoning Era: Could Inverse Reinforcement Learning be the key to advancing LLM reasoning?

2024年10月26日

LLM Reasoning Era: Could Inverse Reinforcement Learning be the key to advancing LLM reasoning?

Yes, we have officially entered the "Reasoning Era." With the introduction of OpenAI's latest model O1, there's a…
Why Efficient Agent Communication is Key in Multi-Agent LLM Systems ?

2024年10月20日

Why Efficient Agent Communication is Key in Multi-Agent LLM Systems ?

The multi-agent LLM space is heating up, and some even say it might bring us closer to AGI. But what are multi-agent…

4 条评论
Step-wise Rewards in RLHF: Could This Be the Breakthrough Behind OpenAI's Strawberry Models?

2024年9月13日

Step-wise Rewards in RLHF: Could This Be the Breakthrough Behind OpenAI's Strawberry Models?

OpenAI's latest update hints at exciting advances in how reinforcement learning (RL) is applied to large language…
My Transition from Ph.D. to Industry: A Thrilling First Six Months Journey!

2023年1月20日

My Transition from Ph.D. to Industry: A Thrilling First Six Months Journey!

So yeah, I completed my Ph.D.
Human-Like Decision Making - Generative Adversarial Imitaion Learning

2018年11月13日

Human-Like Decision Making - Generative Adversarial Imitaion Learning

In my last post(AI to Forcast Serious Things- Beyond Supervised Learning) I discussed how more human-centred decision…
AI to Forcast Serious Things - Beyond Supervised Learning

2018年5月5日

AI to Forcast Serious Things - Beyond Supervised Learning

Forecasting is the core of many applications right now! Being able to see things or feel things before they happen is a…
Why Inverse Reinforcement Learning Is GOLD!

2018年4月12日

Why Inverse Reinforcement Learning Is GOLD!

Inverse Reinforcement Learning(IRL) is not something very new. It popped up with work published by Andrew Ng in the…

4 条评论
Policy Gradients methods in RL

2018年3月14日

Policy Gradients methods in RL

Here's an easy guide to the paper Policy Gradient Methods for Reinforcement Learning with Function Approximation. Link…
Making Deep Learning Real - Memory Augmented Neural Nets

2017年12月11日

Making Deep Learning Real - Memory Augmented Neural Nets

Today I want to give some insights in to the paper called One-shot Learning with Memory-Augmented Neural Networks by…

2 条评论
Applying Deep Learning in the domain of Signal Processing

2017年11月14日

Applying Deep Learning in the domain of Signal Processing

Signal Processing is all about understanding patterns . But what about applying deep learning on sensor data streams ?…

2 条评论

See all articles

The computation of the reward within RLHF settings utilizing the TRL library

Shamane Siri

This is the Chinese translation of my profile.

领英推荐

Shamane Siri的更多文章

社区洞察

其他会员也浏览了

DeepMind’s GenRM Boosts LLM Accuracy by Enabling Self-Verification

Under-thinking in LLMs: Understanding the Phenomenon and Its Implications

PROMPT ENGINEERING

Advanced Prompting Techniques in Large Language Models

Training-Free Long-Context Scaling of Large Language Models

Format of the Meta Language Creation Pattern in Prompt Engineering

What is GPT-3? Everything you want to know.

Top AI/ML Papers of the Week [20/01 - 26/01]

Meta Language Creation Pattern in Prompt Engineering

Prompt Size Limitations of Prompt Engineering

领英推荐

Shamane Siri的更多文章

LLM Reasoning Era: Could Inverse Reinforcement Learning be the key to advancing LLM reasoning?

Why Efficient Agent Communication is Key in Multi-Agent LLM Systems ?

Step-wise Rewards in RLHF: Could This Be the Breakthrough Behind OpenAI's Strawberry Models?

My Transition from Ph.D. to Industry: A Thrilling First Six Months Journey!

Human-Like Decision Making - Generative Adversarial Imitaion Learning

AI to Forcast Serious Things - Beyond Supervised Learning

Why Inverse Reinforcement Learning Is GOLD!

Policy Gradients methods in RL

Making Deep Learning Real - Memory Augmented Neural Nets

Applying Deep Learning in the domain of Signal Processing

社区洞察

其他会员也浏览了

DeepMind’s GenRM Boosts LLM Accuracy by Enabling Self-Verification

Under-thinking in LLMs: Understanding the Phenomenon and Its Implications

PROMPT ENGINEERING

Advanced Prompting Techniques in Large Language Models

Training-Free Long-Context Scaling of Large Language Models

Format of the Meta Language Creation Pattern in Prompt Engineering

What is GPT-3? Everything you want to know.

Top AI/ML Papers of the Week [20/01 - 26/01]

Meta Language Creation Pattern in Prompt Engineering

Prompt Size Limitations of Prompt Engineering