?? Unraveling the Secret Behind ChatGPT's Success: A Deep Dive into Reinforcement Learning from Human Feedback (RLHF)
Credit : huggingface.co

?? Unraveling the Secret Behind ChatGPT's Success: A Deep Dive into Reinforcement Learning from Human Feedback (RLHF)

Ever since OpenAI launched ChatGPT, there's been a buzz around the significant advancements in large language models (LLMs). Although ChatGPT is approximately the same scale as other top-tier LLMs, it outperforms them, offering the potential to introduce new uses or disrupt existing ones. One major factor behind ChatGPT's exceptional performance is its training method known as reinforcement learning from human feedback (RLHF). The application of RLHF predates the first GPT, and its initial use was not for natural language processing.

At its core, reinforcement learning is a machine learning domain where an agent learns a policy through its interaction with the environment. The agent performs actions that influence the environment, causing it to transition to a new state and provide a reward. These rewards serve as feedback signals, enabling the RL agent to adjust its action policy. As the agent undergoes training episodes, it refines its policy to take action sequences that maximize its reward.

Designing an effective reward system is a critical challenge in reinforcement learning. Sometimes, the reward is significantly delayed. For instance, in chess, the RL agent only receives a positive reward after winning, which may require numerous moves. Other times, the reward can't be quantified using a mathematical or logical formula. This is where RLHF comes in - it enhances the RL agent's training by including human involvement in the training process, accounting for elements that can't be measured in the reward system.

However, RLHF is not always practical because it does not scale well. Although machine learning generally scales with computational resources, human involvement in training RL systems becomes a bottleneck. Thus, most RLHF systems blend automated and human-provided reward signals, with the computational reward system providing primary feedback to the RL agent. The human supervisor then occasionally offers an extra reward/punishment signal or supplies the data required to train a reward model.

To illustrate, let's consider a robot designed to cook pizza. You can incorporate some measurable elements into the automated reward system, such as crust thickness and the amount of sauce and cheese. However, to ensure the pizza tastes good, a human needs to taste and score the pizzas the robot makes during the training phase.

Applying RLHF to Language Models

Large language models excel at various tasks, such as text summarization, question answering, text generation, code generation, and protein folding. They can even do zero- and few-shot learning, tackling tasks they have not been specifically trained for. Despite their remarkable achievements, LLMs are essentially extensive prediction machines designed to guess the next token in a sequence (the prompt). When trained on a large text corpus, LLMs develop a mathematical model that can generate long passages of text that are mostly coherent and consistent.

However, language presents a unique challenge: there are often multiple correct responses to a prompt. Moreover, not all of them are desirable, depending on the user, the application, and the context of the LLM. Fortunately, RLHF can guide LLMs in the right direction. When we frame language as an RL problem, the language model serves as the RL agent, the action space is the set of possible language outputs the LLM can generate, and the state space includes the user prompts and the LLM's outputs. The reward measures how closely the LLM's response aligns with the application's context and the user's intent.

Here are some of the key techniques OpenAI has used to improve the ChatGPT model:

  1. Reinforcement Learning from Human Feedback (RLHF): One of the main reasons behind ChatGPT's amazing performance is its training technique: reinforcement learning from human feedback (RLHF). Reinforcement learning is a field of machine learning in which an agent learns a policy through interactions with its environment. The agent takes actions, which affect the environment the agent is in, and returns a reward. These rewards are the feedback signals that allow the RL agent to adjust its action policy. However, designing the right reward system can be challenging. In some applications, the reward is delayed long into the future, while in other applications, the reward can't be defined by a mathematical or logical formula. Reinforcement learning from human feedback enhances the RL agent's training by including humans in the training process, which helps account for elements that can't be measured in the reward system. For example, a human might taste and score the pizzas that a robot cooks during training. However, RLHF scales badly, so most RLHF systems rely on a combination of automated and human-provided reward signals. The computational reward system provides the main feedback to the RL agent, while the human supervisor either helps by providing an extra reward/punishment signal occasionally or the data needed to train a reward model
  2. RLHF for Large Language Models (LLMs): RLHF for language models consists of three phases:

  • Phase 1: Start with a pre-trained language model. Training large language models from scratch with human feedback is virtually impossible. A language model that is pre-trained through unsupervised learning will already have a solid model of the language and will create coherent outputs, though some or many of them might not align with the goals and intents of users.
  • Phase 2: Create a reward model for the RL system. In this phase, another machine learning model is trained that takes in the text generated by the main model and produces a quality score. This model is usually another LLM that has been modified to output a scalar value instead of a sequence of text tokens. To train the reward model, a dataset of LLM-generated text labeled for quality is created. To compose each training example, the main LLM is given a prompt, it generates several outputs, and human evaluators rank the generated text from best to worst. The reward model is then trained to predict the score from the LLM text. By being trained on the LLM’s output and ranking scores, the reward model creates a mathematical representation of human preferences.
  • Phase 3: Create the reinforcement learning loop. A copy of the main LLM becomes the RL agent. In each training episode, the LLM takes several prompts from a training dataset and generates text. Its output is then passed to the reward model, which provides a score that evaluates its alignment with human preferences. The LLM is then updated to create outputs that score higher on the reward model.

3. ChatGPT's Use of RLHF: ChatGPT uses the general RLHF framework described above, with a few modifications. In the first phase, the engineers performed "supervised fine-tuning" on a pre-trained GPT-3.5 model. They hired a group of human writers and asked them to write answers to a bunch of prompts. They used the dataset of prompt-answer pairs to fine-tune the LLM. In the second phase, they created their reward model based on the standard procedure, generating multiple answers to prompts and having them ranked by human.

In the supervised learning phase, an initial model is trained on a large corpus of text data. This involves the use of powerful computer hardware and can take a significant amount of time, possibly on the order of weeks or months, depending on the size of the model and the computational resources available.

In the RLHF phase, human evaluators are involved to rank different responses to prompts, which is then used to create a reward model that the main model learns from. The process involves iteratively adjusting the model to create outputs that score higher on the reward model. This process also requires significant computational resources and time, and the involvement of human evaluators adds an additional layer of complexity.

However, the exact number of people involved and the time it takes can vary depending on many factors, including the resources available, the size and complexity of the model, and the specific training methodology used. The process is likely a significant undertaking involving a team of engineers and evaluators over a period of time. It should be noted that the involvement of human labor in training the model does create a bottleneck in the process, limiting the ability to scale up the training.




要查看或添加评论,请登录

社区洞察

其他会员也浏览了