Role of RAG Noise in Large Language Models & Strategic Chain-of-Thought
Aditi Khare
AWS & AI Research Scientist-Full Stack Applied AI Product Research Engineer & Enterprise Architect | IIM-A | Quantum AI Open-Source | Computer Vision | Author | AI Product Design Strategies |
#ai #airesearchpapers #airesearch
Retrieval-Augmented Generation (RAG) has emerged as a crucial method for addressing hallucinations in large language models (LLMs). While recent research has extended RAG models to complex noisy scenarios, these explorations often confine themselves to limited noise types and presuppose that noise is inherently detrimental to LLMs, potentially deviating from real-world retrieval environments and restricting practical applicability.
In this paper, we define seven distinct noise types from a linguistic perspective and establish a Noise RAG Benchmark (NoiserBench), a comprehensive evaluation framework encompassing multiple datasets and reasoning tasks.
Through empirical evaluation of eight representative LLMs with diverse architectures and scales, we reveal that these noises can be further categorized into two practical groups: noise that is beneficial to LLMs (aka beneficial noise) and noise that is harmful to LLMs (aka harmful noise). While harmful noise generally impairs performance, beneficial noise may enhance several aspects of model capabilities and overall performance.
This analysis offers insights for developing more robust, adaptable RAG solutions and mitigating hallucinations across diverse retrieval scenarios.
Large language models (LLMs) (OpenAI 2023; Meta, AI 2024) have demonstrated remarkable proficiency across various tasks. Despite these impressive capabilities, LLMs face challenges such as reliance on outdated knowledge and hallucination.
Retrieval-Augmented Generation (RAG) has recently emerged as a promising approach to mitigate these limitations. RAG enhances LLM performance by augmenting inputs with additional information retrieved from external sources during inference.
However, the internet is filled with various non-standard noises, including AI-generated fake news, outdated content, spelling errors, and data contamination, which may potentially influence model performance.
It is crucial to explore how noise affects RAG systems and understand the underlying mechanisms. Recently, several studies have attempted to extend RAG systems to complex
real-world scenarios, investigating the impact of noisy documents and strategies to enhance the system’s robustness. Three types of noise in retrieved documents and examines their impacts on LLM. Despite highlighting one noise’s positive effect, the study lacks a comprehensive noise definition and in-depth investigation of underlying principles.
RAG noise into seven types from a linguistic perspective. They are further divided into beneficial (semantic, datatype, and illegal sentence) and harmful noise (counterfactual, supportive, orthographic, and prior) for practical applications. The reason behind this classification in the Experiments section. Semantic Noise (SeN) -
Retrieval documents may contain content with low semantic relevance to the query, often being off-topic or deviating from the intended meaning. Given that Warren Weaver originally defined semantic noise as ”the perturbations or distortions of sentence meaning.
Low-semantic-relevance documents as semantic noise. Datatype Noise (DN) - This type of noise refers to the mixing of different data types on the web, such as the blending of links and text on Wikipedia. In this paper, we consider three types of data: text, URLs, and code. Illegal Sentence Noise (ISN) Web content may include fragments that do not form grammatically correct sentences, such as “history transform cover managed that hand black”. We define this type of noise as illegal sentence noise. Counterfactual Noise (CN) The internet contains abundant false information, including fake news and outdated knowledge which poses significant challenges to RAG systems. Drawing from linguistics, where “counterfactual” denotes statements contrary to fact the term “counterfactual noise” to characterize factual errors.
Supportive Noise (SuN) Supportive evidence, known as positive evidence, is highly semantically relevant to a hypothesis and provides necessary information to support it (Kertesz and R ′ akosi 2012). The term “supportive noise” to describe documents that exhibit high semantic relevance but lack corresponding answer information. Orthographic Noise (ON) The word “orthography” originates from the Greek orthos′ (meaning “correct”) and graphein ′ (meaning “to write”), and refers to the way words are written in linguistics Orthographic noise, on the other hand, can refer to writing errors such as spelling mistakes and word lengthening. Prior Noise (PN) In linguistics, prior knowledge refers to what a learner already knows before solving a problem.
Testbeds Construction - After obtaining highquality QA instances and diverse retrieval documents, we build testbeds to evaluate model performance under various noise conditions. Given the challenges in automatically assessing LLM responses to open-ended QA tasks.
Convert free-form QA into a multiple-choice format. This constrains the response space and facilitates more accurate evaluation. Specifically, for each QA pair, LLMs choose from 4 options: the correct answer, two counterfactual alternatives, and “Uncertain”. The order of the golden option remains entirely random to avoid LLM sensitivity to option order.
References -
Reading Research Paper Link - https://arxiv.org/abs/2408.13533
Strategic Chain-of-Thought - Guiding Accurate Reasoning in LLMs through Strategy Elicitation -
The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs). However, despite their widespread adoption and success, CoT methods often exhibit instability due to their inability to consistently ensure the quality of generated reasoning paths, leading to suboptimal reasoning performance.
This paper propose the Strategic Chain-of-Thought (SCoT), a novel methodology designed to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps. SCoT employs a two-stage approach within a single prompt: first eliciting an effective problem-solving strategy, which is then used to guide the generation of highquality CoT paths and final answers.
领英推荐
Our experiments across eight challenging reasoning datasets demonstrate significant improvements, including a 21.05% increase on the GSM8K dataset and 24.13% on the Tracking Objects dataset, respectively, using the Llama3-8b model. Additionally, we extend the SCoT framework to develop a few-shot method with automatically matched demonstrations, yielding even stronger results. These findings underscore the efficacy of SCoT, highlighting its potential to substantially enhance LLM performance in complex reasoning tasks.
Few-shot Strategic Chain-of-Thought -
The SCoT method into a few-shot version by leveraging the strategy to select demonstrations. Our approach is structured into two stages: constructing a strategy-based demonstration corpus and performing model inference.
Stage 1: Strategic Knowledge-Based Demonstration Corpus Construction -
This stage involves the following two steps, as shown in Figure 2(b): 1. SCoT Answer Generation: We apply the zero-shot SCoT method to the training set to generate a corresponding SCoT answer for each question in the dataset.
2. Demonstration Corpus Construction - The generated answers are compared with the ground truth. Only those accurate question-SCoT answer pairs are retained. This step assumes that the strategic knowledge used in these problems is both correct and relevant. The validated question-SCoT answer pairs are then compiled into a demonstration corpus based on strategic knowledge.
Stage 2: Model Inference - This stage involves the following three steps in a twoquery process, as shown in the right of Figure 2(a): 1. Strategic Knowledge Generation: The LLM generates strategic knowledge relative to the problem, focusing on understanding the problem rather than producing the final answer. 2. Demonstration Matching: The generated strategic knowledge is used to search the demonstration corpus created in Stage 1. The system identifies and matches the most relevant demonstrations with the SCoT answers from the most similar examples.
3. Few-shot Inference - The selected demonstrations are integrated as few-shot examples into the input prompt (Figure 3(b)). This integration guides the model to generate the final prediction based on the provided examples.
Summary -
This paper introduces the Strategic Chain-of-Thought, a method that enables LLMs to autonomously generate an optimal Chain-of-Thought path. By integrating a structured workflow for eliciting and applying strategic knowledge, SCoT enhances the model’s ability to produce a high quality outputs.
Further extend SCoT to a few-shot version by matching demonstrations through strategic knowledge from a predefined strategic knowledge-based corpus.
Experimental results demonstrate the effectiveness of both 0-shot SCoT and few-shot SCoT. Overall, SCoT offers a promising framework for improving the quality of reasoning path in LLMs.
Future research will focus on evaluating its effectiveness with more complex problems and exploring further applications.
References -
Research paper link -
For more information on AI Research Papers you can visit my Github Profile -
For Receving latest updates on Advancements in AI Research Gen-AI, Quantum AI & Computer Vision you can subscribe to my AI Research Papers Summaries Newsletter using below link -
Thank you & Happy Reading !!
Generative AI lead, AI researcher, Technical Lead AI-ML , Data Scientist , ML & AI , deep learning, computer vision, NLP, MLops, Generative AI and LLM, django and REST API, Cloud technologies (Aws, Azure and GCP)
2 周Interesting