?? Unraveling the Secret Behind ChatGPT's Success: A Deep Dive into Reinforcement Learning from Human Feedback (RLHF)
Kumar Saurav
Co-Founder & CTO @ Vodex.ai | Building Generative AI-Powered Virtual Sales Agents
Ever since OpenAI launched ChatGPT, there's been a buzz around the significant advancements in large language models (LLMs). Although ChatGPT is approximately the same scale as other top-tier LLMs, it outperforms them, offering the potential to introduce new uses or disrupt existing ones. One major factor behind ChatGPT's exceptional performance is its training method known as reinforcement learning from human feedback (RLHF). The application of RLHF predates the first GPT, and its initial use was not for natural language processing.
At its core, reinforcement learning is a machine learning domain where an agent learns a policy through its interaction with the environment. The agent performs actions that influence the environment, causing it to transition to a new state and provide a reward. These rewards serve as feedback signals, enabling the RL agent to adjust its action policy. As the agent undergoes training episodes, it refines its policy to take action sequences that maximize its reward.
Designing an effective reward system is a critical challenge in reinforcement learning. Sometimes, the reward is significantly delayed. For instance, in chess, the RL agent only receives a positive reward after winning, which may require numerous moves. Other times, the reward can't be quantified using a mathematical or logical formula. This is where RLHF comes in - it enhances the RL agent's training by including human involvement in the training process, accounting for elements that can't be measured in the reward system.
However, RLHF is not always practical because it does not scale well. Although machine learning generally scales with computational resources, human involvement in training RL systems becomes a bottleneck. Thus, most RLHF systems blend automated and human-provided reward signals, with the computational reward system providing primary feedback to the RL agent. The human supervisor then occasionally offers an extra reward/punishment signal or supplies the data required to train a reward model.
To illustrate, let's consider a robot designed to cook pizza. You can incorporate some measurable elements into the automated reward system, such as crust thickness and the amount of sauce and cheese. However, to ensure the pizza tastes good, a human needs to taste and score the pizzas the robot makes during the training phase.
Applying RLHF to Language Models
Large language models excel at various tasks, such as text summarization, question answering, text generation, code generation, and protein folding. They can even do zero- and few-shot learning, tackling tasks they have not been specifically trained for. Despite their remarkable achievements, LLMs are essentially extensive prediction machines designed to guess the next token in a sequence (the prompt). When trained on a large text corpus, LLMs develop a mathematical model that can generate long passages of text that are mostly coherent and consistent.
However, language presents a unique challenge: there are often multiple correct responses to a prompt. Moreover, not all of them are desirable, depending on the user, the application, and the context of the LLM. Fortunately, RLHF can guide LLMs in the right direction. When we frame language as an RL problem, the language model serves as the RL agent, the action space is the set of possible language outputs the LLM can generate, and the state space includes the user prompts and the LLM's outputs. The reward measures how closely the LLM's response aligns with the application's context and the user's intent.
Here are some of the key techniques OpenAI has used to improve the ChatGPT model:
领英推荐
3. ChatGPT's Use of RLHF: ChatGPT uses the general RLHF framework described above, with a few modifications. In the first phase, the engineers performed "supervised fine-tuning" on a pre-trained GPT-3.5 model. They hired a group of human writers and asked them to write answers to a bunch of prompts. They used the dataset of prompt-answer pairs to fine-tune the LLM. In the second phase, they created their reward model based on the standard procedure, generating multiple answers to prompts and having them ranked by human.
In the supervised learning phase, an initial model is trained on a large corpus of text data. This involves the use of powerful computer hardware and can take a significant amount of time, possibly on the order of weeks or months, depending on the size of the model and the computational resources available.
In the RLHF phase, human evaluators are involved to rank different responses to prompts, which is then used to create a reward model that the main model learns from. The process involves iteratively adjusting the model to create outputs that score higher on the reward model. This process also requires significant computational resources and time, and the involvement of human evaluators adds an additional layer of complexity.
However, the exact number of people involved and the time it takes can vary depending on many factors, including the resources available, the size and complexity of the model, and the specific training methodology used. The process is likely a significant undertaking involving a team of engineers and evaluators over a period of time. It should be noted that the involvement of human labor in training the model does create a bottleneck in the process, limiting the ability to scale up the training.