Extending RAG

Extending RAG

Overview of Large Language Models (LLMs)

In recent years, Large Language Models (LLMs) have revolutionized the field of artificial intelligence. These are advanced AI systems capable of understanding and generating human-like text based on the data they were trained on. Examples of LLMs include models like GPT-4, BERT, Llama.

LLMs excel in several key areas. They can comprehend and produce contextually relevant and coherent text, making them useful for content creation, marketing, and journalism. By leveraging extensive training data, LLMs provide detailed answers to user queries, which is particularly useful in customer service, technical support, and educational applications. They can translate text between multiple languages with high accuracy, facilitating cross-language communication and helping businesses operate in global markets. LLMs can condense long documents into concise summaries, extracting the main points and essential information, which is valuable in fields like research. They can determine sentiment and classify text, useful for monitoring social media, customer feedback, and market research. Additionally, they can generate creative content such as stories and poems, and personalize content based on user preferences, enhancing user engagement.

Despite their impressive capabilities, LLMs have several limitations that can affect their performance:

  • Memory and Context Length Limitations: LLMs have a fixed maximum context length they can handle, which limits the amount of text they can process at one time. For instance, if a document exceeds this limit, the model may only consider a portion of it, ignoring potentially critical information from the rest. This limitation is particularly problematic in scenarios requiring comprehension of long texts, such as legal documents, research papers, or lengthy customer service interactions. The model might lose track of earlier parts of the conversation or document, leading to disjointed or incomplete responses.
  • Static Nature of Pretrained Models: LLMs are typically trained on large datasets up to a certain point in time and do not inherently update their knowledge after training. This means that they lack the ability to incorporate new information post-training. For example, an LLM trained in 2021 would not be aware of events, advancements, or changes that occurred after its training data cutoff. This can result in outdated or irrelevant responses, especially in fast-moving fields like technology, medicine, or current events.
  • Computational Costs of Customization: Adapting an LLM to your specific data involves significant computational resources. Fine-tuning an LLM requires specialized hardware like GPUs or TPUs and considerable training time, which can be prohibitively expensive. Even updating the model with new data periodically to keep it relevant involves substantial computational costs. For small businesses or individual developers, these costs can be a major barrier, limiting their ability to customize and maintain an LLM that meets their specific needs.
  • Hallucinations: LLMs sometimes produce outputs that are factually incorrect or nonsensical, a phenomenon known as "hallucination." This occurs because the models generate text based on patterns learned during training, without a true understanding of the information. For instance, an LLM might generate a convincing but entirely fictional biography of a person or provide inaccurate answers to factual questions. This poses significant risks in applications requiring high accuracy, such as medical advice, legal information, or academic research.


Given these constraints, it's clear that LLMs cannot be used as an all-knowing, infallible source of information. They are powerful tools, but their limitations necessitate enhancements for more reliable and versatile applications. This is where Retrieval-Augmented Generation (RAG) comes into play, addressing some of these challenges and further extending the capabilities of LLMs.


Retrieval-Augmented Generation (RAG)


Originating from the "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" paper, RAG transforms how LLMs process and generate text by incorporating external knowledge sources into their architecture.

RAG operates on a hybrid architecture that combines generative models with robust retrieval mechanisms. While traditional LLMs generate responses based solely on learned patterns, RAG first retrieves relevant information from external knowledge bases or the web before generating a response. These information retrieval systems access structured databases or unstructured sources to gather documents or passages that are contextually relevant to the input query.

Key to RAG's operation is its ability to formulate queries dynamically based on the input received and employ sophisticated ranking algorithms to select the most pertinent information. This ensures that the retrieved data aligns closely with the intent and requirements of the generated response. Once relevant documents or passages are retrieved, they are seamlessly integrated into the generative model's framework. The model then utilizes this augmented knowledge to produce responses that are more informed, coherent, and accurate.


  • Reducing Computational Costs

One of the significant advantages of RAG is its applicability to user-specific data without the need for extensive model retraining. Organizations and developers can leverage RAG to integrate their own proprietary data or domain-specific knowledge bases directly into the generation process. This approach not only enhances response accuracy but also reduces the computational costs associated with training and maintaining a custom LLM.

  • Addressing Context Limitations

Traditional Large Language Models (LLMs) often struggle with limited context sizes, which restrict their ability to comprehend lengthy or multifaceted documents effectively. RAG addresses these challenges by selectively retrieve specific documents or relevant excerpts to suplement the model's internal knowledge. This approach enables the model to augment its understanding with targeted external information, thereby enhancing the relevance and coherence of its responses within the constraints of context size limitations.

  • Dynamic Information Retrieval

One of the defining features of RAG is its capability for dynamic information retrieval. Unlike traditional LLMs that rely solely on pre-trained data, RAG can fetch up-to-date information from databases or the web in real-time. This agility allows the model to stay current with evolving trends, events, or developments, ensuring that the responses it generates are timely and reflective of the latest information available. By integrating real-time data into its generation process, RAG enhances the relevance and reliability of its outputs across various applications and domains.

  • Minimizing Hallucinations

Traditional LLMs face a significant challenge known as "hallucinations," where they generate inaccurate or nonsensical outputs. RAG addresses this issue by grounding its responses in retrieved documents and factual information. By incorporating external knowledge sources, RAG enhances the accuracy and reliability of its outputs. This grounding process ensures that the model's responses are based on verifiable information, thereby reducing the likelihood of producing misleading or erroneous content.


RAG Architecture

Generative Model: At the core of the RAG architecture model lies a generative language model. This model forms the basis for generating text based on learned patterns from extensive training on large datasets.

Retrieval Mechanisms: RAG systems incorporate retrieval mechanisms to supplement the generative capabilities of the model. These mechanisms retrieve relevant information from external knowledge bases, databases, or the web to enrich the context and accuracy of generated responses.

Embeddings and Vector Databases: Text inputs are transformed into numerical representations known as embeddings, capturing semantic meaning and context. Vector databases (VectorDB) store embeddings of relevant documents or passages, facilitating efficient retrieval and comparison during the augmentation process.

Query Formulation and Ranking: RAG systems formulate queries based on the input text and use ranking algorithms to select the most pertinent information from retrieved documents. This ensures that the information integrated into the generative model's framework aligns closely with the intent and context of the input.

Integration and Output Generation: Retrieved information is integrated into the generative model's framework to enhance the accuracy, relevance, and coherence of generated responses. The integrated approach allows RAG systems to produce contextually informed and natural-sounding text outputs that surpass the capabilities of standalone generative models.


From medium


Improvements on RAG

CRAG

Corrective Retrieval Augmented Generation (CRAG) [paper ] is a method to improve the robustness of generation by addressing the issue of inaccurate and misleading knowledge being exposed to generative Large Language Models (LLMs). CRAG proposes a lightweight retrieval evaluator to assess the overall quality of retrieved documents for a query and trigger different knowledge retrieval actions. It also utilizes large-scale web searches to augment the retrieval results and a decompose-then-recompose algorithm to focus on key information while filtering out irrelevant content.

How it works

CRAG is a corrective strategy designed to improve the robustness of generation in RAG-based approaches. It addresses the issue of inaccurate and misleading knowledge being retrieved and utilized by LLMs.

CRAG introduces a retrieval evaluator to assess the relevance of retrieved documents to the input query. This evaluator calculates a confidence degree, triggering one of three actions: Correct, Incorrect, or Ambiguous.

When the evaluator triggers the Correct action, it indicates that the retrieved documents are relevant.

If the Incorrect action is triggered, it means the retrieved documents are deemed irrelevant, so they are discarded, and CRAG resorts to large-scale web searches to find complementary knowledge sources for corrections.

In the case of the Ambiguous action, both internal and external knowledge sources are combined to balance and strengthen the system's robustness.

CRAG also utilizes web searches as an extension to overcome the limitations of static and limited corpora. It employs a web search API to generate relevant URL links for queries, transcribing the content of these web pages. The same knowledge refinement process is then applied to derive relevant web knowledge.

The knowledge refinement process in CRAG further refines relevant retrieved documents, by a decomposition-then-recomposition algorithm. This process segments the documents into fine-grained knowledge strips, the retrieval evaluator calculates the relevance score for each strip, filters out the irrelevant ones, and recombines the relevant strips to create more precise internal knowledge.

Benefits

By implementing these techniques, CRAG improves the precision and relevance of generated responses while enhancing control over the information retrieval process. It corrects irrelevant documents and optimizes the extraction of key insights, making it a robust and adaptable solution for RAG-based approaches.

CRAG DAG


Self-RAG

Self-Reflective Retrieval-Augmented Generation (Self-RAG)[paper ] is a framework designed to enhance the quality and factual accuracy of LLMs by incorporating retrieval and self-reflection capabilities. Despite their remarkable capabilities, LLMs often produce responses with factual inaccuracies due to their sole reliance on parametric knowledge.

Self-RAG aims to improve the generation process of LLMs by introducing two key components: retrieval on demand and self-reflection. It trains a single arbitrary LM to determine when to retrieve relevant knowledge, generate text, and critique its own output using special tokens called reflection tokens. This approach enhances the LM's versatility and adaptability, ensuring it can tailor its behavior to diverse task requirements.


How it works

Self-RAG utilizes reflection tokens to control the retrieval process and guide the generation of text. These tokens include "Retrieve" to indicate the need for knowledge retrieval, "Relevant" to assess the relevance of retrieved passages, "Supported" to evaluate if the output is supported by evidence, and "Useful" to determine the overall utility of the response. The LM is trained to generate these tokens along with textual output, allowing for fine-grained control and customization during inference.

During training, Self-RAG employs a critic model to generate reflection tokens for evaluating retrieved passages and the quality of the LM's output. This critic model is then used to update the training corpus by inserting reflection tokens offline, eliminating the need for a critic model during inference. The LM is trained on a diverse collection of text interleaved with reflection tokens and retrieved passages.

At inference, Self-RAG dynamically decides when to retrieve text passages based on the probability of generating the "Retrieve" token. It processes multiple passages in parallel and uses reflection tokens to enforce soft constraints or hard control over the generated output. This allows users to customize the LM's behavior by adjusting weights assigned to different reflection token types, prioritizing factual accuracy or creativity as needed.


Benefits

Self-RAG offers several advantages over traditional RAG approaches. It improves the quality and factual accuracy of LLMs by selectively retrieving relevant knowledge and critiquing the generated output. This adaptive retrieval process enhances the LM's versatility and ensures that the output aligns closely with available evidence. Additionally, Self-RAG enables users to tailor the LM's behavior during inference, making it widely applicable and more robust.


Simplification of the method

While the traditional Self-RAG framework involves training a dedicated LM, we can simplify the method by harnessing the power of pre-trained Large Language Models (LLMs).

The first step is to evaluate the faithfulness of LLM generations to the retrieved documents. Faithfulness refers to the alignment between the generated response and the content of the retrieved passages, ensuring no hallucinations or inaccurate information are presented. To achieve this, we can employ fact-checking mechanisms, where the LLM's output is verified against the knowledge sources, ensuring that the generated response maintains logical coherence with the context of the retrieved passages. In case the answer is considered an halucination, trigger a regeneration of the original model's response.

In addition to faithfulness, evaluating the usefulness, the relevance and value of the generated response in answering the given question, we can use a relevance scoring mechanism, where the LLM assigns scores to the output based on factors like topic alignment, context matching, and the presence of key information. In case the model fails this step, given the generation passed the faithfulness step, the solution passes in re-retrieving data from a new source and repeting the process.

By combining these evaluation techniques, we can simplify the Self-RAG framework while harnessing the strengths of LLMs. This approach ensures that the generated responses are not only accurate and faithful to the retrieved passages but also useful and relevant to the user's query. By addressing both faithfulness and usefulness, we enhance the reliability and informativeness of the LLM's output, making it a more robust and trusted tool for language generation tasks. Paired with CRAG we can ensure that the generated response is precise, relevance and faithfull to our knowledge pool.


CRAG + Self RAG DAG



Adaptive RAG

Retrieval-Augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA). However, existing approaches either handle simple queries with unnecessary computational overhead or fail to adequately address complex multi-step queries. To address this, Adaptive RAG [paper ] proposes, a novel adaptive QA framework that can dynamically select the most suitable strategy for LLMs based on the query complexity.


How it Works

At its core, Adaptive RAG similar to self-RAG, introduces a classifier, a smaller language model (LM) that is trained to assess the complexity level of incoming queries. This classifier is a crucial component that enables the framework to select the most appropriate strategy for handling each query. The classifier is trained using automatically collected labels, which are obtained from the actual predicted outcomes of models and inherent inductive biases in datasets.

The classifier categorizes queries into three complexity levels: 'A', 'B', and 'C'.

'A' indicates a straightforward query that can be answered by the LLM itself, 'B' represents a query of moderate complexity requiring at least a single-step approach, and 'C' denotes a complex query that demands the most extensive solution involving multiple retrieval steps.

By pre-defining the query complexity with this classifier, Adaptive RAG can seamlessly adapt between different retrieval-augmented strategies, from the simplest non-retrieval approach for straightforward queries to the most comprehensive multi-step approach for complex queries. This adaptability ensures that resources are efficiently allocated, providing a balanced and effective solution


Benefits

Adaptive RAG has been validated on a set of open-domain QA datasets, covering multiple query complexities. The results demonstrate that Adaptive RAG enhances the overall efficiency and accuracy of QA systems compared to relevant baselines, including adaptive retrieval approaches. It provides a robust middle ground, adapting between iterative, single-step, and no-retrieval methods, based on the complexity of the query. This adaptability makes it highly effective and efficient, allocating resources efficiently to handle complex queries while simplifying the process for simpler queries.



Adaptive RAG + Self RAG + CRAG


Implementation

An example on how to use these 3 methodologies can be found on my GitHub.

In this repo, I took a page from Mistral's and LangChain RAG cookbook and adapted it to my needs.

This system leverages the Arxiv API to source research papers on various topics. The documents are embedded using OpenAI embeddings and stored in ChromaDB, facilitating efficient retrieval. The core model for this project is GPT-3.5-turbo.

A significant modification I introduced is in the Adaptive RAG process. Specifically, I used the abstracts from the extracted papers to generate a set of keywords using the base model. These keywords serve as labels for adaptive data source selection. When a user query is processed, the model classifies it to determine relevance to the stored documents keywords. If the query is relevant, the system retrieves information directly from the vector database. Otherwise, the system falls back to web search to obtain the necessary documents. This adaptive retrieval mechanism optimizes response accuracy and efficiency by intelligently selecting the most appropriate data source.



Conclusion

As we navigate the rapidly evolving field of artificial intelligence, Large Language Models like GPT-4, BERT, and Llama have become pivotal. They excel in generating coherent text, translating languages, summarizing documents, and more. However, they face significant challenges, including memory and context length limitations, static nature post-training, high computational costs for customization, and the risk of producing hallucinations. Retrieval-Augmented Generation (RAG) addresses these challenges by integrating external knowledge sources, enhancing the relevance and accuracy of the responses.

RAG mitigates these issues by:

  • Enriching context with external information
  • Dynamic retrieval of up-to-date knowledge
  • Reducing retraining costs
  • Grounding responses in factual data


Overview of Optimization Strategies

Innovations like Corrective RAG (CRAG), Self-Reflective RAG (Self-RAG), and Adaptive RAG (ARAG) further refine this approach:

  • CRAG enhances retrieval accuracy and robustness by correcting irrelevant information and optimizing knowledge extraction.
  • Self-RAG improves factual accuracy and quality by incorporating self-reflection and dynamic retrieval.
  • Adaptive RAG efficiently handles queries of varying complexities by selecting the most suitable retrieval strategy.


I encourage you to explore and implement RAG optimizations in your AI projects. Let's leverage these advancements to overcome the limitations of traditional LLMs and unlock new possibilities.

Join the discussion and share your experiences and insights on LinkedIn.

Katrina Collins

Products + Purpose + AI | Building products and leading teams with a passion for positive transformation, experimentation, and innovation

4 个月

It is by far the best explanation of the enhanced RAG framework I have read so far!

Inês Nunes

Neuropsicology student and researcher

4 个月

Insightful!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了