Step-wise Rewards in RLHF:  Could This Be the Breakthrough Behind OpenAI's Strawberry Models?

Step-wise Rewards in RLHF: Could This Be the Breakthrough Behind OpenAI's Strawberry Models?


OpenAI's latest update hints at exciting advances in how reinforcement learning (RL) is applied to large language models (LLMs). While RL during training isn't new, the idea of spending more time thinking during inference seems to be a novel concept worth exploring.

In its current form, RLHF assigns a single reward for the entire sequence, treating it as one step, with episodic memory effectively 1. This structure makes it difficult to implement traditional RL techniques, which distribute step-wise rewards across multiple steps in a trajectory. The current approach limits the model’s ability to refine outputs progressively because reward models are trained to evaluate entire sentences or sequences, making token-level rewards impractical. The reason episodic memory is often set to 1 in current methods is likely due to the fact that reward models are typically trained to generate a reward for the entire sentence, based on how the data is annotated. Human annotators usually provide feedback at the sentence level, not for individual tokens, which may be the reason current systems focus on single-step rewards rather than rewarding each token in the sequence.

However, OpenAI may have discovered a way to integrate rewards at each generation step, allowing for fine-grained learning. If successful, this would enable LLMs to refine their outputs with each token rather than waiting for a final reward at the end of the sequence. Such an approach would mirror the step-wise rewards used in traditional RL, providing a continuous feedback loop that improves the model’s learning process over time.

This might also be the "secret" OpenAI hinted at, referring to models spending more time thinking during inference. By enabling the model to explore longer trajectories and utilize multi-step memory, the model can refine its reasoning in real-time, leading to improved decision-making accuracy and more data-efficient learning.

In traditional RL, rewards are computed for each action within a trajectory, as demonstrated by the PPO algorithm. The red-highlighted sections in the attached image illustrate how rewards-to-go are computed step by step. If OpenAI can apply this step-wise reward mechanism to LLMs, it could bridge the gap between reinforcement learning in traditional RL domains and the sequence-based reward structures currently used in LLMs.


PPO Psedo Code


We don't know what is actually there since they just didn't mention anything specific. This is just my thoughts and speculation based on the limited information provided. More details are needed to understand the full picture.


Here are some references

  1. Proximal Policy Optimization (PPO) : - Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347
  2. Temporal Difference (TD) Error: - Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3(1), 9-44. https://link.springer.com/article/10.1007/BF00115009
  3. Deep Reinforcement Learning from Human Preferences: - Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1706.03741
  4. Reinforcement Learning from Human Feedback (RLHF) Original Paper - Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.

https://arxiv.org/abs/2203.02155



要查看或添加评论,请登录

Shamane Siri的更多文章