The computation of the reward within RLHF settings utilizing the TRL library

The computation of the reward within RLHF settings utilizing the TRL library


The Huggingface Transformer Reinforcement Learning (TRL) library simplifies Reinforcement Learning from Human Feedback (RLHF) settings. Typical RLHF steps include supervised fine-tuning, reward model training, and finetuning with Proximal Policy Optimization (PPO).

The main goal of this blog post on RLHF is to explain the utilization of rewards generated by a pre-trained reward model in the PPO optimization framework.

In the RLHF setting, mapping RL concepts to language generation tasks can be clarified by considering the following key points.

  • Action Space: In RLHF for language generation tasks, the action space is equivalent to generating a word in the Language Model (LLM).
  • Time Steps: Each time step in language generation tasks represents generating a single token, such as a word or subword unit.
  • Episodes: In RLHF, an episode refers to a single example generation output, where the agent generates a complete piece of text based on a given input or prompt.


PPO loss function

No alt text provided for this image
PPO Loss function [Paper - Proximal Policy Optimization Algorithms Paper link - https://arxiv.org/abs/1707.06347]

Given the abundance of resources available on the basics of the PPO algorithm, this section will not cover them. Instead, the focus is on the key idea of applying PPO and the essential step of computing the advantage function at each time step, as depicted in the above figure.

To compute the advantage function at each step, a common approach is to use an estimate based on the TD (Temporal Difference) error. This estimation allows us to quantify the difference between the predicted value and the actual value observed in reinforcement learning scenarios.

No alt text provided for this image
[Huggingface blog - Proximal Policy Optimization (PPO) Blog link - https://huggingface.co/blog/deep-rl-]

In the above figure, "r" represents the reward for each time step, which is equivalent to the reward assigned to each generated token. Hence, it is necessary to compute the reward for each time step in language generation tasks.


A key question arises: How do we compute rewards per token when our reward model provides a scalar value for the entire generation?


To tackle this challenge, RLHF with LLM adopts a distinctive approach.

It leverages the reward for the entire output sequence and utilizes it as the reward for each individual token in the sequence.

However, this approach could have unintended consequences, such as the model attempting to generate gibberish or incoherent outputs solely to fool the reward model.

To mitigate this, we introduce a KL penalty for each token, computing the divergence between the current model's policy and the reference model's policy. The reference model is a frozen copy of the language model used for training.

This KL penalty ensures that our model does not deviate excessively from the original model.

Typically, we compute the log probabilities for the generated text from the current model and compare them to the log probabilities from the frozen model for the exact generated text.

The final reward per token is a combination of entire scaler reward and the KL divergence. Actually what happens is we only add the scaler reward to the final tokens KL divergence value. All other per token rewards come from the per-token KL divegenve between the current model and the forzen model.
No alt text provided for this image
Learning to summarize from human feedback : link - https://arxiv.org/pdf/2009.01325.pdf


The following code snippet exemplifies how the TRL library accomplishes the computation of per-token rewards, showcasing its simplicity and elegance.

No alt text provided for this image


Moreover, in the RLHF with LLM approach, the value function per token is computed using the AutoModelForCausalLMWithValueHead. This component is a fully connected network that generates a scalar value as output after processing the final representation for each token in the sequence. By utilizing this network architecture, the value function is estimated for individual tokens, aiding in the reinforcement learning process within the RLHF framework.

No alt text provided for this image


Given an episode, which represents a generated output for a given input, we now have per-token rewards, value function estimations, and action probabilities. With this information, we can seamlessly apply PPO.

Yes that is the magic behind the per token reward computation. Cheers to the future of RLHF!

要查看或添加评论,请登录

Shamane Siri的更多文章

社区洞察

其他会员也浏览了