登录查看更多内容

Step-wise Rewards in RLHF: Could This Be the Breakthrough Behind OpenAI's Strawberry Models?

Shamane Siri

This is the Chinese translation of my profile.

发布日期: 2024年9月13日

OpenAI's latest update hints at exciting advances in how reinforcement learning (RL) is applied to large language models (LLMs). While RL during training isn't new, the idea of spending more time thinking during inference seems to be a novel concept worth exploring.

In its current form, RLHF assigns a single reward for the entire sequence, treating it as one step, with episodic memory effectively 1. This structure makes it difficult to implement traditional RL techniques, which distribute step-wise rewards across multiple steps in a trajectory. The current approach limits the model’s ability to refine outputs progressively because reward models are trained to evaluate entire sentences or sequences, making token-level rewards impractical. The reason episodic memory is often set to 1 in current methods is likely due to the fact that reward models are typically trained to generate a reward for the entire sentence, based on how the data is annotated. Human annotators usually provide feedback at the sentence level, not for individual tokens, which may be the reason current systems focus on single-step rewards rather than rewarding each token in the sequence.

However, OpenAI may have discovered a way to integrate rewards at each generation step, allowing for fine-grained learning. If successful, this would enable LLMs to refine their outputs with each token rather than waiting for a final reward at the end of the sequence. Such an approach would mirror the step-wise rewards used in traditional RL, providing a continuous feedback loop that improves the model’s learning process over time.

This might also be the "secret" OpenAI hinted at, referring to models spending more time thinking during inference. By enabling the model to explore longer trajectories and utilize multi-step memory, the model can refine its reasoning in real-time, leading to improved decision-making accuracy and more data-efficient learning.

In traditional RL, rewards are computed for each action within a trajectory, as demonstrated by the PPO algorithm. The red-highlighted sections in the attached image illustrate how rewards-to-go are computed step by step. If OpenAI can apply this step-wise reward mechanism to LLMs, it could bridge the gap between reinforcement learning in traditional RL domains and the sequence-based reward structures currently used in LLMs.

We don't know what is actually there since they just didn't mention anything specific. This is just my thoughts and speculation based on the limited information provided. More details are needed to understand the full picture.

Here are some references

Proximal Policy Optimization (PPO) : - Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. https://arxiv.org/abs/1707.06347
Temporal Difference (TD) Error: - Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3(1), 9-44. https://link.springer.com/article/10.1007/BF00115009
Deep Reinforcement Learning from Human Preferences: - Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1706.03741
Reinforcement Learning from Human Feedback (RLHF) Original Paper - Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.

https://arxiv.org/abs/2203.02155

要查看或添加评论，请登录

Shamane Siri的更多文章

LLM Reasoning Era: Could Inverse Reinforcement Learning be the key to advancing LLM reasoning?

2024年10月26日

LLM Reasoning Era: Could Inverse Reinforcement Learning be the key to advancing LLM reasoning?

Yes, we have officially entered the "Reasoning Era." With the introduction of OpenAI's latest model O1, there's a…
Why Efficient Agent Communication is Key in Multi-Agent LLM Systems ?

2024年10月20日

Why Efficient Agent Communication is Key in Multi-Agent LLM Systems ?

The multi-agent LLM space is heating up, and some even say it might bring us closer to AGI. But what are multi-agent…

4 条评论
The computation of the reward within RLHF settings utilizing the TRL library

2023年7月16日

The computation of the reward within RLHF settings utilizing the TRL library

The Huggingface Transformer Reinforcement Learning (TRL) library simplifies Reinforcement Learning from Human Feedback…
My Transition from Ph.D. to Industry: A Thrilling First Six Months Journey!

2023年1月20日

My Transition from Ph.D. to Industry: A Thrilling First Six Months Journey!

So yeah, I completed my Ph.D.
Human-Like Decision Making - Generative Adversarial Imitaion Learning

2018年11月13日

Human-Like Decision Making - Generative Adversarial Imitaion Learning

In my last post(AI to Forcast Serious Things- Beyond Supervised Learning) I discussed how more human-centred decision…
AI to Forcast Serious Things - Beyond Supervised Learning

2018年5月5日

AI to Forcast Serious Things - Beyond Supervised Learning

Forecasting is the core of many applications right now! Being able to see things or feel things before they happen is a…
Why Inverse Reinforcement Learning Is GOLD!

2018年4月12日

Why Inverse Reinforcement Learning Is GOLD!

Inverse Reinforcement Learning(IRL) is not something very new. It popped up with work published by Andrew Ng in the…

4 条评论
Policy Gradients methods in RL

2018年3月14日

Policy Gradients methods in RL

Here's an easy guide to the paper Policy Gradient Methods for Reinforcement Learning with Function Approximation. Link…
Making Deep Learning Real - Memory Augmented Neural Nets

2017年12月11日

Making Deep Learning Real - Memory Augmented Neural Nets

Today I want to give some insights in to the paper called One-shot Learning with Memory-Augmented Neural Networks by…

2 条评论
Applying Deep Learning in the domain of Signal Processing

2017年11月14日

Applying Deep Learning in the domain of Signal Processing

Signal Processing is all about understanding patterns . But what about applying deep learning on sensor data streams ?…

2 条评论

See all articles

Here are some references

Shamane Siri的更多文章

LLM Reasoning Era: Could Inverse Reinforcement Learning be the key to advancing LLM reasoning?

Why Efficient Agent Communication is Key in Multi-Agent LLM Systems ?

The computation of the reward within RLHF settings utilizing the TRL library

My Transition from Ph.D. to Industry: A Thrilling First Six Months Journey!

Human-Like Decision Making - Generative Adversarial Imitaion Learning

AI to Forcast Serious Things - Beyond Supervised Learning

Why Inverse Reinforcement Learning Is GOLD!

Policy Gradients methods in RL

Making Deep Learning Real - Memory Augmented Neural Nets

Applying Deep Learning in the domain of Signal Processing