Retrieval Augmented Generation (RAG) v/s Long-Context (LC) reasoning tradeoffs in Transformer based Language Models
Arnab Dutta
Innovative Engineering Leader | Expert in AI Engineering & Data Analytics Solutions | Full-stack Web and Mobile Generative AI Product Development | Ex-Deloitte and Wipro
NOTE - In this article, I will share my personal views on both these approaches, leveraging insights from recent research papers and my own experience of building Gen-AI products such as apnaAI and POCs.
For the context of this article:
Introduction
RAG, specifically vector-based RAG (vRAG), is a method where the system retrieves data chunks semantically related to the query using techniques like cosine similarity or Euclidean distance. This approach allows language models (SLMs or LLMs) to generate responses based on relevant data retrieved dynamically, reducing the need to process large contexts directly. On the other hand, long context (LC) LLMs like GPT-4 or Gemini 1.5 pro or Llama3.1-70B can process vast text sequences by incorporating everything into their input token. The size of the model supports it's ability to filter out distractions from such long contexts and align the reasoning with the context of the query prompt.
However, as much as LC LLMs provide seamless context comprehension, there is a cost associated with processing extremely long contexts: potentially reducing focus and an increase in computational complexity, an inherent characteristic of the self-attention and positional encoding architecture of the uncerpinning transformer-decoder model architecture.
Understanding the Key Differences and Tradeoffs
Self-Attention and Positional Encoding: The Transformer’s Core Mechanisms
The self-attention mechanism, as introduced in the original Transformer paper by Vaswani et al., allows LLMs to compute relationships between tokens at any position, which is crucial for handling long dependencies within a text. Meanwhile, positional encoding provides the model with a sense of sequence, ensuring that it can maintain the order of words in a sentence. However, as context length increases, the ability of these mechanisms to distinguish relevant from irrelevant information diminishes. This is where the concept of thresholds comes into play. Both self-attention and positional encoding have inherent limitations when dealing with excessive token lengths. The challenge is in balancing focus and reasoning, and this balance becomes harder to maintain as the context grows. In practice, LC models begin to struggle when faced with excessively long texts, often losing their ability to differentiate important from unimportant data.
Vector RAG Implementation Approaches
Step 1: Modelling
Step 2: Retrieval Approaches
Deciding between using vector search alone or employing hybrid approaches that combine vector search with symbolic or rule-based retrieval methods is key. Hybrid approaches may provide better balance between accuracy and speed, especially in complex query scenarios.
Step 3: Context Selection for Model Input
RAG’s Efficiency vs. LC’s All-in Approach
While LC LLMs excel at keeping everything in memory, RAG relies on a more targeted approach—retrieving and passing only relevant data to the modls. In doing so, RAG avoids the noise that comes with processing lengthy contexts. However, RAG isn’t without flaws. It struggles with multi-step reasoning, general queries, and questions that require full context comprehension. That’s why splitting queries and context routing become crucial. My experience implementing Context Routing, similar to the SELF-ROUTE approach proposed by Li et al., demonstrates how dividing a query into shorter segments and routing them through different pipelines (vRAG or LC LLMs) can yield better results for short, specific queries, but this quickly becomes costly for more complex ones.
The paper from Li et al. evaluates their proposed SELF-ROUTE approach against standard LC and RAG approaches, showing 65% cost reduction for Gemini-1.5 and 39% for GPT-4O, with minimal performance loss.
领英推荐
When building RAG approaches for apna AI, we also experimented with query splitting, breaking long or multi-step questions into smaller ones to pass through a RAG pipeline. Although this technique can lead to more precise answers, it adds complexity and the risk of complete context loss. We observed marginal improvements in performance at a much higher computational cost, hence this wasn’t explored further.
Order-Preserve RAG (OP-RAG): A new approach proposed by authors from Nvidia
In a recent paper, authors from NVIDIA introduced the concept of Order-Preserve RAG (OP-RAG). This method maintains the original order of retrieved chunks, allowing the model to see the documents structure wrt the retrieved chunks. Unlike traditional RAG approaches that tend to re-order the retrieved chunks based on relevance scores, OP-RAG delivers superior answer quality by preserving coherence. In the experiments on two benchmark datasets, OP-RAG outperformed LC LLM reasoning by requiring fewer tokens to generate high-quality responses, illustrating how focused retrieval combined with attention to information structure can sometimes outperform brute-force LC models.
Understanding and Balancing the Tradeoffs
As shown by recent research and practice, LC LLM reasoning have a threshold where their ability to reason falters due to the overwhelming amount of input. This thereshold is directly proportional to the size and type of model, higher for Llama 70B model vs a 8B one. The inverted U-curve pattern observed in the OP-RAG study highlights this clearly: as the number of retrieved chunks grows, performance improves up to a point, but then declines due to information overload. This is reflective of the challenge faced by transformers' self-attention mechanism, which keeps path lengths between tokens short in theory, but becomes overwhelmed with noise and clutter in long contexts.
While OP-RAG presents an interesting RAG approach by maintaining the order of retrieved chunks, the study does have its limitations. The experiments primarily focused on the specific implementation of processing 128K chunk sizes, and the results were evaluated on two public datasets, which may not fully reflect the complexity of real-life scenarios.
In practical applications, the order of retrieved chunks may not always correlate directly with knowledge dependencies between them. For instance, there could be multiple documents to retrieve from, where the order of information retrieved data is not necessarily reflective of logical or conceptual continuity, thereby causing OP-RAG techniques to underperform, as maintaining strict chunk order could miss important cross-references or contextual links.
A potential improvement to OP-RAG would involve modifying the retrieval process to not only maintain chunk order but also dynamically reorder chunks based on their relative position and contextual similarity within the document. This approach could better reflect the real-life complexity of documents, where concepts are interrelated but not always sequentially arranged. Such a method could provide a more flexible way to handle long contexts, while also preserving key knowledge relationships across chunks.
From my own experience, I’ve learned that context length, infromation quality and sequence of chunks feeding into LLM matters. Short and simple queries tend to benefit from RAG pipelines, especially when all relevant chunks can be retrieved and ordered efficiently. Long, complex queries, on the other hand, benefit from LC reasoning but require careful handling to avoid information overload and hallucinations. RAG's also perform well if a fine-tuned embedding model is used as retreivals become more efficient and contextually relevant. I will be experimenting more with Graph-RAG's and Graph-LC techniques to see any notable differences in output quality.
Thank you for reading. Please leave your comments below.
References
Li, Z., Li, C., Zhang, M., Mei, Q., & Bendersky, M. (2024). Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. Google DeepMind.
Yu, T., Xu, A., & Akkiraju, R. (2024). In Defense of RAG in the Era of Long-Context Language Models. NVIDIA.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ?., & Polosukhin, I. (2017). Attention is All You Need.
Senior Consultant - Digital & Innovation Incentives
2 周Great article, Arnab! Your in-depth analysis of the trade-offs between RAG and LC reasoning brought back memories of our work on apnaAI. I really like how you highlighted the balance between retrieval precision and computational complexity, something we also encountered firsthand. Looking forward to hearing more about your experiments with Graph-RAGs! Always a pleasure learning from you.
Product Engineer | Generative AI & LLM
2 周well articulated, it provides a great overview of the ongoing debate between RAG and LC in the LLM space. The insights on the tradeoffs, especially the impact of context length on model performance, are very valuable. The discussion on OP-RAG and its potential to address some of the limitations of traditional RAG is particularly interesting. It underscores the need for more nuanced and adaptable approaches in specific use cases. Looking forward to seeing more research from you in this area!