Google's Training Language Models to Self-Correct via Reinforcement Learning & Iteration of Thought - Autonomous Large Language Model Reasoning
Aditi Khare
AWS & AI Research Specialist-Principal Machine Learning Scientist & AI Architect | IIM-A | Author | AI Research [Portfolio] Build Production-Grade AI Products from Scratch | Vision Transformers??Open-Source Contributor
#ai #airesearch #airesearchpapers #genai #rl #llm
Google's Training Language Models to Self-Correct via Reinforcement Learning
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision.
This paper explains multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior.
In particular, It is observe that training via SFT either suffers from a distribution mismatch between the training data and the model’s own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time.
SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training.
When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.
Large language models (LLMs) have proven to be a useful tool in reasoning and scientific domains such as mathematical problem-solving and coding. An aspirational property of LLMs in such settings is to able to implement algorithms - Strategies that help the LLM to use computation and interaction to improve its response on the test-time query. Modern LLMs largely do not implement algorithms reliably: for instance, consider a problem setting that requires models to detect and revise (or “self-correct”) their own responses to a given test-time query, so as to be able to eventually arrive at the best-possible final response.
This sort of self-correction capability has been shown by several recent works to be severely lacking in current LLMs, especially in the absence of external input (also referred to as intrinsic self-correction). To make progress towards the eventual goal of teaching LLMs to implement algorithms to handle challenging inputs, we study a special instance of training LLMs to implement self-correction strategies to fix their mistakes “on-the-fly”.
This should be possible on many queries where current LLMs fail, they still contain the underlying “knowledge” needed to arrive at the correct response but are unable to correctly elicit and draw inferences about their own knowledge when needed.
For example, strong LLMs can often successfully complete a sub-part of a math proof when prompted with the remainder, but may not be able to complete it from scratch. In a similar vein, leveraging their previous responses should, in principle, enable LLMs to improve their subsequent ones.
Prompting for intrinsic self-correction. Recent work demonstrates that LLMs struggle to self-correct their reasoning errors without external feedback and na?vely running self-correction can degrade performance.
For example, use oracle ground-truth answers during self-correction that may not be available generally. Use weak prompts for initial responses, thereby perhaps overestimate the improvement possible by self-correction. This indicates that there is no major work showing successful intrinsic self-correction via prompting alone. In the context of code self-repair, shows that even when strong models are prompted with some form of partial feedback, e.g., showing test-cases but not the desired outcomes on those test-cases, they are often unable to correct their mistakes. Sampling multiple responses in parallel attains much better results in Fine-tuning for intrinsic self-correction.
To address the issues with prompting off-the-shelf models alone, several works run supervised fine-tuning (SFT) or weighted SFT on the LLM to generate a revision given an initial response. Nonetheless typical works in this literature rely on oracle feedback: e.g., obtaining revisions directly from human annotators (Saunders et al., 2022) or stronger models.
This paper aims to train for self-correction entirely without the use of bigger models or humans, when the learner itself is asked to generate its own training data. Similar to these prior works, we assume access to a reward function for evaluating model-generated outputs. Other approaches build pipelines with multiple models for self-correction Self-correct. While this can lead to good results, these pipelines do not quite tackle self-correction and require system design for serving multiple models at deployment.
Qualitative Analysis of SCoRe - This paper performs qualitative investigation into how SCoRe addresses the self-repair shortcomings of base LLMs, and provide several examples in Appendix B. We find that SCoRe is able to refine its own responses in a variety of manners - rewriting the entire solution when necessary, or reproducing the correct parts of the solution, while revising the incorrect ones.
SCoRe is especially adept at revising its computational mistakes, and even demonstrates a bias towards showing more steps in certain computations and manipulations in order to increase its probability of producing a correct answer. Additionally it is observed that the model learns to occasionally self-correct within a turn.
Summary -
In this work, we investigated how to imbue LLMs with a self-correction strategy that enables them to correct their own responses on the fly, at test-time.
This paper proposes SCoRe, a multi-turn online reinforcement learning (RL) approach for training language models to correct their own mistakes, and demonstrated through extensive evaluations that it is the first method that can attain significantly positive intrinsic self-correction performance. To motivate the design of SCoRe, we rigorously analyzed the behavior of various fine-tuning baselines and identified failure modes in which the model learns a non-correcting strategy (e.g. learning to make no edits) under these approaches.
SCoRe is designed to elicit a self-correcting strategy by utilizing a two-stage structure and reward shaping, both of which help prevent model collapse into not learning effective self-improvement behavior.
References -
Reference Reading Link -
Iteration of Thought - Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning -
Iterative human engagement is a common and effective means of leveraging the advanced language processing power of large language models (LLMs). Using well-structured prompts in a conversational manner, human users can effectively influence an LLM to develop more thoughtful and accurate responses.
This paper proposes the Iteration of Thought (IoT) framework for enhancing LLM responses by generating "thought"-provoking prompts vis a vis an input query and the current iteration of an LLM's response. Unlike static or semi-static approaches, e.g. Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT adapts its reasoning path dynamically, based on evolving context, and without generating alternate explorative thoughts which are ultimately discarded. The three components of the IoT framework are
(1) Inner Dialogue Agent (IDA) - Responsible for generating instructive, context-specific prompts.
(2) LLM Agent (LLMA) that processes these prompts to refine its responses; and (3) an iterative prompting loop that implements a conversation between the former two components.
This paper introduces two variants of our framework -
领英推荐
Autonomous Iteration of Thought (AIoT), where an LLM decides when to stop iterating, and Guided Iteration of Thought (GIoT), which always forces a fixed number iterations. We investigate the performance of IoT across various datasets, spanning complex reasoning tasks from the GPQA dataset, explorative problem-solving in Game of 24, puzzle solving in Mini Crosswords, and multi-hop question answering from the HotpotQA dataset.
The results in this paper shows that IoT represents a viable paradigm for autonomous response refinement in LLMs, showcasing significant improvements over CoT and thereby enabling more adaptive and efficient reasoning systems that minimize human intervention.
A human user’s interaction with in an LLM often proceeds as follows - The user poses a question to the LLM, receives an initial response, and, if the answer is incomplete or suboptimal, provides additional guidance to the LLM by reiterating contextual clues (e.g. by reminding the LLM of its role, suggesting additional information to consider, or highlighting specific parts of the response that need refinement).
This back-and-forth process helps narrow the focus of the LLM while reducing the research effort required from the user, since the LLM is responsible for the bulk of the reasoning and information retrieval.
IoT utilizes an Inner Dialogue Agent (IDA) to adjust and refine its reasoning path during each iteration. This enables adaptive exploration across different reasoning trees, fostering a more flexible and context-aware response generation process. A comparison to existing methods is shown schematically in Figure 1.
The core IoT framework is composed of three main components. Further details are also provided in Section 2. ? Inner dialogue agent (IDA): The IDA functions as a "guide" that dynamically generates context-sensitive prompts based on the original user query and the LLM’s previous response. The adjusted prompts servce to iteratively lead the LLM toward more refined and accurate answers.
Mathematically, the IDA can be represented as a function C : Q × R × K′ → P, where Q is the space of possible queries, R is the space of potential LLM responses, and P is the space of generated prompts. At each step, it takes the current query q ∈ Q and the previous response r ∈ R to generate a new prompt p ∈ P. This process makes prompt generation dynamic, differentiating IoT from more rigid approaches like CoT and allowing it to adapt to an evolving context.
? LLM agent (LLMA): The LLMA embodies the core reasoning capabilities of an LLM and processes the IDA’s dynamically generated prompts. It uses an LLM’s internal knowledge base K to refine its responses. Formally, we model the LLMA as a function L : Q × P × K → R. The LLMA takes as input a query q, prompt p and a knowledge base K then generates a refined response r. The LLMA also identifies areas of uncertainty or gaps in its own reasoning, providing feedback for the IDA to adjust prompts accordingly. This interaction creates a closed-loop system that continuously improves the quality of answers without external inputs. ? Iterative prompting loop: The iterative process in IoT involves a back-and-forth between the IDA and LLMA. At each iteration i, the IDA generates a new prompt pi = C(q, ri?1) based on the original query q and the LLM’s previous response ri?1. The LLMA then responds to pi with ri = L(q, pi , K). This loop continues until a satisfactory answer r ? is found or the arbitrary maximum iteration count is reached.
This back-and-forth approach allows IoT to navigate complex reasoning paths to efficiently explore various potential solutions. Moreover, introducing distrinct LLMs for the IDA and LLMA respectively can allow each agent to function as an open system where internal knowledge is exchanged. In this scenario, the overall system behaves as a closed system with a combined knowledge base, enhancing internal reasoning without external input.
Summary -
This paper Introduces the Iteration of Thought (IoT) framework, in which an Inner Dialogue Agent (IDA) iteratively converses with an LLM Agent (LLMA) to perform various complex reasoning tasks like solving puzzles (Game of 24, Mini Crosswords) and answering difficult questionnaires (GPQA, HotpotQA).
We employed two variants of this framework in our experiments, qualified as "autonomous" (AIoT) and "guided" (GIoT) respectively, to compare iteration-terminating mechanisms across these tasks. GIoT, the variant that always performs a fixed number of iterations, was seen to perform better than AIoT, the variant that self-determines termination, in Game of 24.
On the other hand, AIoT had superior performance on GPQA. Both variants performed similarly on Mini Crosswords and always performed better than the wellknown Chain of Thought (CoT) framework, wherever compared.
This paper also compares IoT framework against the hierarchical AgentLite framework on the multi-context HotpotQA task, finding improvements of approximately a 35% in the F1 score and 44% in the EM score over AgentLite. All together, our results demonstrate that IoT can succesfully introduce productive dynamism into low-complexity agentic frameworks.
Determining the scale and diversity of the IDA’s knowledge base represents a promising direction for future work aiming to maximize the real-world utility of IoT. In pursuit of strictly framework-toframework comparisons.
Used only off-the-shelf, general-purpose LLMs in all our experiments to establish IoT. Moving forward, specialized language models like fine-tuned LLMs or LLMs equipped with additional tools and/or data sources could yield further performance gains, whether by increasing the effective knowledge base or directly addressing challenges like hallucination and the premature termination of iterations.
References -
Reference Reading Link -
Paper Reading Link -
Github Link -
For more information on AI Research Papers you can visit my Github Profile -
For Receving latest updates on Advancements in AI Research Gen-AI, Quantum AI & Computer Vision you can subscribe to my AI Research Papers Summaries Newsletter using below link -
Thank you & Happy Reading !!