登录查看更多内容

Google's Training Language Models to Self-Correct via Reinforcement Learning & Iteration of Thought - Autonomous Large Language Model Reasoning

Aditi Khare

AWS & AI Research [LLMs & Vision]-Principal Machine Learning Scientist & AI Architect | IIM-A | Author | Inference Optimization | Hyperspectral Imaging | Open-Source Dev | Build Production-Grade AI Products from Scratch

发布日期: 2024年9月22日

+ 关注

#ai #airesearch #airesearchpapers #genai #rl #llm

Google's Training Language Models to Self-Correct via Reinforcement Learning

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision.

This paper explains multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior.

In particular, It is observe that training via SFT either suffers from a distribution mismatch between the training data and the model’s own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time.

SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training.

When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

Large language models (LLMs) have proven to be a useful tool in reasoning and scientific domains such as mathematical problem-solving and coding. An aspirational property of LLMs in such settings is to able to implement algorithms - Strategies that help the LLM to use computation and interaction to improve its response on the test-time query. Modern LLMs largely do not implement algorithms reliably: for instance, consider a problem setting that requires models to detect and revise (or “self-correct”) their own responses to a given test-time query, so as to be able to eventually arrive at the best-possible final response.

This sort of self-correction capability has been shown by several recent works to be severely lacking in current LLMs, especially in the absence of external input (also referred to as intrinsic self-correction). To make progress towards the eventual goal of teaching LLMs to implement algorithms to handle challenging inputs, we study a special instance of training LLMs to implement self-correction strategies to fix their mistakes “on-the-fly”.

This should be possible on many queries where current LLMs fail, they still contain the underlying “knowledge” needed to arrive at the correct response but are unable to correctly elicit and draw inferences about their own knowledge when needed.

For example, strong LLMs can often successfully complete a sub-part of a math proof when prompted with the remainder, but may not be able to complete it from scratch. In a similar vein, leveraging their previous responses should, in principle, enable LLMs to improve their subsequent ones.

Prompting for intrinsic self-correction. Recent work demonstrates that LLMs struggle to self-correct their reasoning errors without external feedback and na?vely running self-correction can degrade performance.

For example, use oracle ground-truth answers during self-correction that may not be available generally. Use weak prompts for initial responses, thereby perhaps overestimate the improvement possible by self-correction. This indicates that there is no major work showing successful intrinsic self-correction via prompting alone. In the context of code self-repair, shows that even when strong models are prompted with some form of partial feedback, e.g., showing test-cases but not the desired outcomes on those test-cases, they are often unable to correct their mistakes. Sampling multiple responses in parallel attains much better results in Fine-tuning for intrinsic self-correction.

To address the issues with prompting off-the-shelf models alone, several works run supervised fine-tuning (SFT) or weighted SFT on the LLM to generate a revision given an initial response. Nonetheless typical works in this literature rely on oracle feedback: e.g., obtaining revisions directly from human annotators (Saunders et al., 2022) or stronger models.

This paper aims to train for self-correction entirely without the use of bigger models or humans, when the learner itself is asked to generate its own training data. Similar to these prior works, we assume access to a reward function for evaluating model-generated outputs. Other approaches build pipelines with multiple models for self-correction Self-correct. While this can lead to good results, these pipelines do not quite tackle self-correction and require system design for serving multiple models at deployment.

Qualitative Analysis of SCoRe - This paper performs qualitative investigation into how SCoRe addresses the self-repair shortcomings of base LLMs, and provide several examples in Appendix B. We find that SCoRe is able to refine its own responses in a variety of manners - rewriting the entire solution when necessary, or reproducing the correct parts of the solution, while revising the incorrect ones.

SCoRe is especially adept at revising its computational mistakes, and even demonstrates a bias towards showing more steps in certain computations and manipulations in order to increase its probability of producing a correct answer. Additionally it is observed that the model learns to occasionally self-correct within a turn.

Summary -

In this work, we investigated how to imbue LLMs with a self-correction strategy that enables them to correct their own responses on the fly, at test-time.

This paper proposes SCoRe, a multi-turn online reinforcement learning (RL) approach for training language models to correct their own mistakes, and demonstrated through extensive evaluations that it is the first method that can attain significantly positive intrinsic self-correction performance. To motivate the design of SCoRe, we rigorously analyzed the behavior of various fine-tuning baselines and identified failure modes in which the model learns a non-correcting strategy (e.g. learning to make no edits) under these approaches.

SCoRe is designed to elicit a self-correcting strategy by utilizing a two-stage structure and reward shaping, both of which help prevent model collapse into not learning effective self-improvement behavior.

References -

Reference Reading Link -

https://arxiv.org/abs/2409.12917

Iteration of Thought - Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning -

Iterative human engagement is a common and effective means of leveraging the advanced language processing power of large language models (LLMs). Using well-structured prompts in a conversational manner, human users can effectively influence an LLM to develop more thoughtful and accurate responses.

This paper proposes the Iteration of Thought (IoT) framework for enhancing LLM responses by generating "thought"-provoking prompts vis a vis an input query and the current iteration of an LLM's response. Unlike static or semi-static approaches, e.g. Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT adapts its reasoning path dynamically, based on evolving context, and without generating alternate explorative thoughts which are ultimately discarded. The three components of the IoT framework are

(1) Inner Dialogue Agent (IDA) - Responsible for generating instructive, context-specific prompts.

(2) LLM Agent (LLMA) that processes these prompts to refine its responses; and (3) an iterative prompting loop that implements a conversation between the former two components.

This paper introduces two variants of our framework -

领英推荐

Deploying LLMs in Production: The Anatomy of LLM…

XenonStack 1 年前

Reinforcement learning and Mixture of Experts in…

Ramesh Yerramsetti 1 个月前

The DeepSeek-R1 Breakthrough: Reinforcement Learning…

David Sehyeon Baek 1 个月前

Autonomous Iteration of Thought (AIoT), where an LLM decides when to stop iterating, and Guided Iteration of Thought (GIoT), which always forces a fixed number iterations. We investigate the performance of IoT across various datasets, spanning complex reasoning tasks from the GPQA dataset, explorative problem-solving in Game of 24, puzzle solving in Mini Crosswords, and multi-hop question answering from the HotpotQA dataset.

The results in this paper shows that IoT represents a viable paradigm for autonomous response refinement in LLMs, showcasing significant improvements over CoT and thereby enabling more adaptive and efficient reasoning systems that minimize human intervention.

A human user’s interaction with in an LLM often proceeds as follows - The user poses a question to the LLM, receives an initial response, and, if the answer is incomplete or suboptimal, provides additional guidance to the LLM by reiterating contextual clues (e.g. by reminding the LLM of its role, suggesting additional information to consider, or highlighting specific parts of the response that need refinement).

This back-and-forth process helps narrow the focus of the LLM while reducing the research effort required from the user, since the LLM is responsible for the bulk of the reasoning and information retrieval.

IoT utilizes an Inner Dialogue Agent (IDA) to adjust and refine its reasoning path during each iteration. This enables adaptive exploration across different reasoning trees, fostering a more flexible and context-aware response generation process. A comparison to existing methods is shown schematically in Figure 1.

The core IoT framework is composed of three main components. Further details are also provided in Section 2. ? Inner dialogue agent (IDA): The IDA functions as a "guide" that dynamically generates context-sensitive prompts based on the original user query and the LLM’s previous response. The adjusted prompts servce to iteratively lead the LLM toward more refined and accurate answers.

Mathematically, the IDA can be represented as a function C : Q × R × K′ → P, where Q is the space of possible queries, R is the space of potential LLM responses, and P is the space of generated prompts. At each step, it takes the current query q ∈ Q and the previous response r ∈ R to generate a new prompt p ∈ P. This process makes prompt generation dynamic, differentiating IoT from more rigid approaches like CoT and allowing it to adapt to an evolving context.

? LLM agent (LLMA): The LLMA embodies the core reasoning capabilities of an LLM and processes the IDA’s dynamically generated prompts. It uses an LLM’s internal knowledge base K to refine its responses. Formally, we model the LLMA as a function L : Q × P × K → R. The LLMA takes as input a query q, prompt p and a knowledge base K then generates a refined response r. The LLMA also identifies areas of uncertainty or gaps in its own reasoning, providing feedback for the IDA to adjust prompts accordingly. This interaction creates a closed-loop system that continuously improves the quality of answers without external inputs. ? Iterative prompting loop: The iterative process in IoT involves a back-and-forth between the IDA and LLMA. At each iteration i, the IDA generates a new prompt pi = C(q, ri?1) based on the original query q and the LLM’s previous response ri?1. The LLMA then responds to pi with ri = L(q, pi , K). This loop continues until a satisfactory answer r ? is found or the arbitrary maximum iteration count is reached.

This back-and-forth approach allows IoT to navigate complex reasoning paths to efficiently explore various potential solutions. Moreover, introducing distrinct LLMs for the IDA and LLMA respectively can allow each agent to function as an open system where internal knowledge is exchanged. In this scenario, the overall system behaves as a closed system with a combined knowledge base, enhancing internal reasoning without external input.

Summary -

This paper Introduces the Iteration of Thought (IoT) framework, in which an Inner Dialogue Agent (IDA) iteratively converses with an LLM Agent (LLMA) to perform various complex reasoning tasks like solving puzzles (Game of 24, Mini Crosswords) and answering difficult questionnaires (GPQA, HotpotQA).

We employed two variants of this framework in our experiments, qualified as "autonomous" (AIoT) and "guided" (GIoT) respectively, to compare iteration-terminating mechanisms across these tasks. GIoT, the variant that always performs a fixed number of iterations, was seen to perform better than AIoT, the variant that self-determines termination, in Game of 24.

On the other hand, AIoT had superior performance on GPQA. Both variants performed similarly on Mini Crosswords and always performed better than the wellknown Chain of Thought (CoT) framework, wherever compared.

This paper also compares IoT framework against the hierarchical AgentLite framework on the multi-context HotpotQA task, finding improvements of approximately a 35% in the F1 score and 44% in the EM score over AgentLite. All together, our results demonstrate that IoT can succesfully introduce productive dynamism into low-complexity agentic frameworks.

Determining the scale and diversity of the IDA’s knowledge base represents a promising direction for future work aiming to maximize the real-world utility of IoT. In pursuit of strictly framework-toframework comparisons.

Used only off-the-shelf, general-purpose LLMs in all our experiments to establish IoT. Moving forward, specialized language models like fine-tuned LLMs or LLMs equipped with additional tools and/or data sources could yield further performance gains, whether by increasing the effective knowledge base or directly addressing challenges like hallucination and the premature termination of iterations.

References -

Reference Reading Link -

Paper Reading Link -

https://arxiv.org/abs/2409.12618

Github Link -

https://github.com/AgnostiqHQ/multi-agent-llm

For more information on AI Research Papers you can visit my Github Profile -

https://github.com/aditikhare007/AI_Research_Junction_Aditi_Khare

For Receving latest updates on Advancements in AI Research Gen-AI, Quantum AI & Computer Vision you can subscribe to my AI Research Papers Summaries Newsletter using below link -

https://www.dhirubhai.net/newsletters/7152631955203739649/

Thank you & Happy Reading !!

AI Research Junction

1,676 位关注者

要查看或添加评论，请登录

Aditi Khare的更多文章

LLM Inference-Time Self-Improvement & DeepSeek & Modern BERT

2025年1月26日

LLM Inference-Time Self-Improvement & DeepSeek & Modern BERT

#ai #genai #research #researchpapers #llm #inference LLM Inference-Time Self-Improvement - LLM Inference-Time Self…

1 条评论
OpenAI's AI Powered Search Engine Into ChatGPT

2024年11月1日

OpenAI's AI Powered Search Engine Into ChatGPT

#ai #searchgpt #airesearch #genai Introducing ChatGPT Search - ChatGPT can now search the web in a much better way than…
Introducing Anthropic's Claude 3.5 Sonnet, and Claude 3.5 Haiku

2024年10月23日

Introducing Anthropic's Claude 3.5 Sonnet, and Claude 3.5 Haiku

#ai #airesearchpapers #genai #claude #anthropic For more information on AI Research Papers you can visit my Github…
OpenAI Introduces Swarm, a Framework for Building Multi-Agent Systems

2024年10月12日

OpenAI Introduces Swarm, a Framework for Building Multi-Agent Systems

#openai #ai #airesearch #airesearchpapers #researchskills For more information on AI Research Papers you can visit my…
Architecture Search Framework for Inference-Time Techniques & Designing Priors for Better Few-Shot Image Synthesis

2024年10月7日

Architecture Search Framework for Inference-Time Techniques & Designing Priors for Better Few-Shot Image Synthesis

#ai #genai #architecture #search #researchpapers #researchskills #computervision #pattern recognition Inference-time…
Meta's Llama 3.2 - Edge AI & Vision with Open, Customizable Models

2024年9月28日

Meta's Llama 3.2 - Edge AI & Vision with Open, Customizable Models

#ai #airesearch #meta #llm #genai #vision Meta has released Llama 3.2 - A small and medium-sized vision LLMs (11B and…
Agents in Software Engineering-Survey, Landscape, and Vision & Qwen2.5-Coder

2024年9月24日

Agents in Software Engineering-Survey, Landscape, and Vision & Qwen2.5-Coder

#ai #airesearch #genai #researchskills Agents in Software Engineering: Survey, Landscape, and Vision - Large Language…
Anthropic Introduces Contextual Retrieval Using Prompt Caching & Contextual Embeddings & Reranking Techniques

2024年9月23日

Anthropic Introduces Contextual Retrieval Using Prompt Caching & Contextual Embeddings & Reranking Techniques

#ai #airesearch #anthropic #embeddings #llm #genai Introducing Contextual Retrieval - Developers typically enhance an…
Learning to Reason with LLMs - Introducing OpenAI o1

2024年9月14日

Learning to Reason with LLMs - Introducing OpenAI o1

#ai #openai #llms #genai #airesearch #airesearchskills #airesearchpapers Introducing OpenAI o1-Preview - A new series…

1 条评论
LongCite - Enabling LLMs to Generate Fine-grained Citations in Long-context QA

2024年9月10日

LongCite - Enabling LLMs to Generate Fine-grained Citations in Long-context QA

#ai #airesearchskills #airesearch #genai #llms Though current long-context large language models (LLMs) have…

See all articles

Google's Training Language Models to Self-Correct via Reinforcement Learning & Iteration of Thought - Autonomous Large Language Model Reasoning

Aditi Khare

AWS & AI Research [LLMs & Vision]-Principal Machine Learning Scientist & AI Architect | IIM-A | Author | Inference Optimization | Hyperspectral Imaging | Open-Source Dev | Build Production-Grade AI Products from Scratch

Iteration of Thought - Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning -

领英推荐

AI Research Junction

1,676 位关注者

Aditi Khare的更多文章

社区洞察

其他会员也浏览了

Deep Dive into DeepSeek R1: Revolutionizing LLM Reinforcement Learning through Group Relative Policy Optimization (GRPO)

The Alchemy of Language: Distilling High-Quality Models from Small Language Models (SLMs)

Unleash the Power of Existing Models: Fine-Tuning & PEFT

AI Breakthrough and Piaget’s Constructivism

REINFORCE: A Simple and Effective Approach to LLM Alignment

Enhancing LLM Accuracy: Researchers Tackle Unexpected Results with Advanced Techniques

Language Models Are Unsupervised Multitask Learners: A Game-Changing Leap in AI

How does Reinforcement Learning from Human Feedback work?

Reinforcement Learning from Human Feedback (RLHF) and Large Language Models (LLMs): The Magic Sauce behind ChatGPT

Iteration of Thought - Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning -

领英推荐

AI Research Junction

1,676 位关注者

Aditi Khare的更多文章

LLM Inference-Time Self-Improvement & DeepSeek & Modern BERT

OpenAI's AI Powered Search Engine Into ChatGPT

Introducing Anthropic's Claude 3.5 Sonnet, and Claude 3.5 Haiku

OpenAI Introduces Swarm, a Framework for Building Multi-Agent Systems

Architecture Search Framework for Inference-Time Techniques & Designing Priors for Better Few-Shot Image Synthesis

Meta's Llama 3.2 - Edge AI & Vision with Open, Customizable Models

Agents in Software Engineering-Survey, Landscape, and Vision & Qwen2.5-Coder

Anthropic Introduces Contextual Retrieval Using Prompt Caching & Contextual Embeddings & Reranking Techniques

Learning to Reason with LLMs - Introducing OpenAI o1

LongCite - Enabling LLMs to Generate Fine-grained Citations in Long-context QA

社区洞察

其他会员也浏览了

Deep Dive into DeepSeek R1: Revolutionizing LLM Reinforcement Learning through Group Relative Policy Optimization (GRPO)

The Alchemy of Language: Distilling High-Quality Models from Small Language Models (SLMs)

Unleash the Power of Existing Models: Fine-Tuning & PEFT

AI Breakthrough and Piaget’s Constructivism

REINFORCE: A Simple and Effective Approach to LLM Alignment

Enhancing LLM Accuracy: Researchers Tackle Unexpected Results with Advanced Techniques

Language Models Are Unsupervised Multitask Learners: A Game-Changing Leap in AI

How does Reinforcement Learning from Human Feedback work?

Reinforcement Learning from Human Feedback (RLHF) and Large Language Models (LLMs): The Magic Sauce behind ChatGPT