Retrieval Augmented Generation (RAG) v/s Long-Context (LC) reasoning tradeoffs in Transformer based Language Models

Retrieval Augmented Generation (RAG) v/s Long-Context (LC) reasoning tradeoffs in Transformer based Language Models

NOTE - In this article, I will share my personal views on both these approaches, leveraging insights from recent research papers and my own experience of building Gen-AI products such as apnaAI and POCs.

For the context of this article:

  • RAG based reasoning: Pass a subset (chunk) of a document to the language model for reasoning, like one or more sentences or paragraphs that are semantically similar to the input query
  • LC based reasoning: Pass a much longer context to a LLM, which can be the full source document or set of documents. The term "long-context" (LC) has become more prominent with the development of models like GPT-4 and Gemini-1.5, which can directly process a significantly higher number of tokens (e.g., up to 1 million tokens), thereby questioning the need for sophisticated RAGs.

Introduction

RAG, specifically vector-based RAG (vRAG), is a method where the system retrieves data chunks semantically related to the query using techniques like cosine similarity or Euclidean distance. This approach allows language models (SLMs or LLMs) to generate responses based on relevant data retrieved dynamically, reducing the need to process large contexts directly. On the other hand, long context (LC) LLMs like GPT-4 or Gemini 1.5 pro or Llama3.1-70B can process vast text sequences by incorporating everything into their input token. The size of the model supports it's ability to filter out distractions from such long contexts and align the reasoning with the context of the query prompt.

However, as much as LC LLMs provide seamless context comprehension, there is a cost associated with processing extremely long contexts: potentially reducing focus and an increase in computational complexity, an inherent characteristic of the self-attention and positional encoding architecture of the uncerpinning transformer-decoder model architecture.

Understanding the Key Differences and Tradeoffs

Self-Attention and Positional Encoding: The Transformer’s Core Mechanisms

The self-attention mechanism, as introduced in the original Transformer paper by Vaswani et al., allows LLMs to compute relationships between tokens at any position, which is crucial for handling long dependencies within a text. Meanwhile, positional encoding provides the model with a sense of sequence, ensuring that it can maintain the order of words in a sentence. However, as context length increases, the ability of these mechanisms to distinguish relevant from irrelevant information diminishes. This is where the concept of thresholds comes into play. Both self-attention and positional encoding have inherent limitations when dealing with excessive token lengths. The challenge is in balancing focus and reasoning, and this balance becomes harder to maintain as the context grows. In practice, LC models begin to struggle when faced with excessively long texts, often losing their ability to differentiate important from unimportant data.

Vector RAG Implementation Approaches

Step 1: Modelling

  1. Chunk size, overlap, and chunk strategy (simple vs logical vs semantic): The size of the text chunks and how they overlap will affect retrieval precision, compute costs and completeness. Simple chunking splits based on length, logical chunking splits based on natural breaks in content like paragraphs or sections, and semantic chunking splits a document based on semantic relevancy of chunks.
  2. Metadata for filtering and augmenting LM input: Adding metadata, such as source or timestamp or other information, helps refine the retrieval process and ensures the model generates more contextually accurate outputs.
  3. Embedding model – size of dimensions, finetuned or standard models: The dimensionality of the embedding vectors plays a role in retrieval accuracy. While standard embedding models like OpenAI’s perform well for general-purpose use, fine-tuned models are crucial for domain-specific queries, yielding higher accuracy by aligning the similarity scores with domain-specific semantics.

Step 2: Retrieval Approaches

Deciding between using vector search alone or employing hybrid approaches that combine vector search with symbolic or rule-based retrieval methods is key. Hybrid approaches may provide better balance between accuracy and speed, especially in complex query scenarios.

Step 3: Context Selection for Model Input

  1. Cutoff or sorting using semantic score thresholds: Set a threshold for how closely chunks need to match the query before being included in the model's context. For systems where information loss doesn’t have a huge penalty, you can also cutoff based on number of chunks retrieved.
  2. Re-ranking algorithms: Use a more computationally heavy model, like a cross-encoder, to re-evaluate the relevance of each retrieved chunk and re-rank them. This improves answer quality by emphasizing semantic similarity.
  3. Small child chunks linked to bigger chunks: Small chunks can be tied to larger chunks that provide a broader context for the model. These larger chunks can either be pre-indexed or dynamically selected using a window method algorithm to extract surrounding sentences around the retrieved chunks.

RAG’s Efficiency vs. LC’s All-in Approach

While LC LLMs excel at keeping everything in memory, RAG relies on a more targeted approach—retrieving and passing only relevant data to the modls. In doing so, RAG avoids the noise that comes with processing lengthy contexts. However, RAG isn’t without flaws. It struggles with multi-step reasoning, general queries, and questions that require full context comprehension. That’s why splitting queries and context routing become crucial. My experience implementing Context Routing, similar to the SELF-ROUTE approach proposed by Li et al., demonstrates how dividing a query into shorter segments and routing them through different pipelines (vRAG or LC LLMs) can yield better results for short, specific queries, but this quickly becomes costly for more complex ones.

The paper from Li et al. evaluates their proposed SELF-ROUTE approach against standard LC and RAG approaches, showing 65% cost reduction for Gemini-1.5 and 39% for GPT-4O, with minimal performance loss.

Reference: Li et al.


When building RAG approaches for apna AI, we also experimented with query splitting, breaking long or multi-step questions into smaller ones to pass through a RAG pipeline. Although this technique can lead to more precise answers, it adds complexity and the risk of complete context loss. We observed marginal improvements in performance at a much higher computational cost, hence this wasn’t explored further.

Order-Preserve RAG (OP-RAG): A new approach proposed by authors from Nvidia

In a recent paper, authors from NVIDIA introduced the concept of Order-Preserve RAG (OP-RAG). This method maintains the original order of retrieved chunks, allowing the model to see the documents structure wrt the retrieved chunks. Unlike traditional RAG approaches that tend to re-order the retrieved chunks based on relevance scores, OP-RAG delivers superior answer quality by preserving coherence. In the experiments on two benchmark datasets, OP-RAG outperformed LC LLM reasoning by requiring fewer tokens to generate high-quality responses, illustrating how focused retrieval combined with attention to information structure can sometimes outperform brute-force LC models.


Reference: Yu et al. (Nvidia paper)


Understanding and Balancing the Tradeoffs

As shown by recent research and practice, LC LLM reasoning have a threshold where their ability to reason falters due to the overwhelming amount of input. This thereshold is directly proportional to the size and type of model, higher for Llama 70B model vs a 8B one. The inverted U-curve pattern observed in the OP-RAG study highlights this clearly: as the number of retrieved chunks grows, performance improves up to a point, but then declines due to information overload. This is reflective of the challenge faced by transformers' self-attention mechanism, which keeps path lengths between tokens short in theory, but becomes overwhelmed with noise and clutter in long contexts.


Reference: Yu et al. (Nvidia paper)


While OP-RAG presents an interesting RAG approach by maintaining the order of retrieved chunks, the study does have its limitations. The experiments primarily focused on the specific implementation of processing 128K chunk sizes, and the results were evaluated on two public datasets, which may not fully reflect the complexity of real-life scenarios.

In practical applications, the order of retrieved chunks may not always correlate directly with knowledge dependencies between them. For instance, there could be multiple documents to retrieve from, where the order of information retrieved data is not necessarily reflective of logical or conceptual continuity, thereby causing OP-RAG techniques to underperform, as maintaining strict chunk order could miss important cross-references or contextual links.

A potential improvement to OP-RAG would involve modifying the retrieval process to not only maintain chunk order but also dynamically reorder chunks based on their relative position and contextual similarity within the document. This approach could better reflect the real-life complexity of documents, where concepts are interrelated but not always sequentially arranged. Such a method could provide a more flexible way to handle long contexts, while also preserving key knowledge relationships across chunks.

From my own experience, I’ve learned that context length, infromation quality and sequence of chunks feeding into LLM matters. Short and simple queries tend to benefit from RAG pipelines, especially when all relevant chunks can be retrieved and ordered efficiently. Long, complex queries, on the other hand, benefit from LC reasoning but require careful handling to avoid information overload and hallucinations. RAG's also perform well if a fine-tuned embedding model is used as retreivals become more efficient and contextually relevant. I will be experimenting more with Graph-RAG's and Graph-LC techniques to see any notable differences in output quality.

Thank you for reading. Please leave your comments below.

References

Li, Z., Li, C., Zhang, M., Mei, Q., & Bendersky, M. (2024). Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. Google DeepMind.

Yu, T., Xu, A., & Akkiraju, R. (2024). In Defense of RAG in the Era of Long-Context Language Models. NVIDIA.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ?., & Polosukhin, I. (2017). Attention is All You Need.

Pranav Krishna

Senior Consultant - Digital & Innovation Incentives

2 周

Great article, Arnab! Your in-depth analysis of the trade-offs between RAG and LC reasoning brought back memories of our work on apnaAI. I really like how you highlighted the balance between retrieval precision and computational complexity, something we also encountered firsthand. Looking forward to hearing more about your experiments with Graph-RAGs! Always a pleasure learning from you.

Sourav Bhattacharya

Product Engineer | Generative AI & LLM

2 周

well articulated, it provides a great overview of the ongoing debate between RAG and LC in the LLM space. The insights on the tradeoffs, especially the impact of context length on model performance, are very valuable. The discussion on OP-RAG and its potential to address some of the limitations of traditional RAG is particularly interesting. It underscores the need for more nuanced and adaptable approaches in specific use cases. Looking forward to seeing more research from you in this area!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了