Retrieval Augmented Generation (RAG) vs Large Context Window in Large Language Models (LLMs)

Retrieval Augmented Generation (RAG) vs Large Context Window in Large Language Models (LLMs)

The first few months of 2024 marked a pivotal moment in the AI field as Google and Anthropic released models (Gemini 1.5 Pro,?Claude 3) that are capable of accepting inputs that exceed 1 million tokens.

To put things into perspective, a context window of 1 million tokens could analyse the entire Harry Potter collection (or roughly 750,000 words) in a Single prompt.

And then at the Google I/O 2024 on 14th May 2024, Google was announced that Gemini 1.5 Pro will have a massive 2 million token context window ?? Nous-Capybara-34B V1.9 a new open source model also offers a 2 million token context window.

The availability of such massive context windows in LLMs has sparked a debate globally with a lot of people asking what is the need or value of spending time and effort in building and deploying RAG anymore?        

To answer this question, we need to first take a deeper look at the high-level design and workings of RAG and Context Windows in LLMs.

RAG

RAG combines the power of LLMs with external knowledge sources to produce more informed and accurate responses. When a user query is received, the RAG system initially processes it to understand its context and intent. Then it retrieves data relevant to the user’s query from some specified?knowledge base or database?(which can be or in most cases internal to the organisation) and passes it, along with the query, to the LLM as “chunks” or context – providing more accurate, relevant, and up-to-date responses (as depicted in the diagram below).

High-level design and functioning of a typical RAG system

Context Windows

Context Window for an LLM refers to the number of words (i.e., tokens) that can be entered (as a query/input) by users via a single prompt for generating a response. An LLM uses a context window to identify the nuances of language and contextual information to create the most appropriate machine translation. The longer/larger the context window, the more data can be added into the prompt (as input). So, stuffing more data into the large/long contextual window of a LLM should generate more precise and accurate responses by increasing the LLM’s “short-term memory” and its in-context learning ability when generating the outputs. The extended contextual window can potentially enhance the model’s ability to understand more comprehensive narratives and?complex ideas by acting as a “lens” to provide additional or deeper perspective/context to the LLMs as well as enabling the model to grasp and connect information from parts of the text that are distant from each other – thereby potentially improving the overall quality and relevance of generated responses.

LLMs with a large context window can also “remember” more effectively their initial instructions and the content of the most recent conversation - allowing them to deliver responses that are more relevant to the ongoing conversation and tasks.

This is particularly valuable in tasks requiring document summarization, extended AI conversations, or complex problem-solving where previous input and context is vital.

Pros & Cons of RAG and Large Context Windows

While larger context windows can give access to more contextual information for the LLMs, it also increases the amount of data that the LLMs have to process. It can thus make it more challenging for the LLMs to identify the useful/relevant information (from the irrelevant ones or “noise”) in the large amount of data that gets fed/entered by the users and may sometimes overwhelm the LLMs (just like the “information overload” scenario we as humans also face quite often ??) - resulting in “distraction” and more chances of “hallucination”. Processing large input datasets also requires more computational resources - resulting in more computational costs/expenses and potentially slower performance.

In a recent study titled?"Lost in the Middle" by Nelson F. Liu and colleagues from Stanford University, they demonstrated that advanced LLMs frequently struggle to extract significant information from their context windows, particularly when important information is buried within the middle area of the context window (as the current LLMs seem to prioritize the data at the beginning and end of the context window). According to their findings, LLMs do best when provided with fewer, more relevant data in the context, rather than huge amount of information. Moreover, in their technical report, Google shows how Gemini 1.5 Pro performs on the “Needle in a Haystack” test, which asks the model to retrieve a few given sentences from a provided text. As can be seen in the image below, Gemini 1.5 Pro has a higher recall at shorter context lengths, and a small decrease in recall toward 1M tokens, where the recall tends to be around 60%. This means that around 40% of the relevant sentences are “lost” to the model.

RAG on the other hand allows the LLMs to search for and retrieve relevant information from an external knowledge base/database before generating a response without being overwhelmed by a massive internal (context) window. So, in essence RAG takes a more targeted approach and retrieves only the most relevant information, making it faster and more cost effective. Moreover, focusing on relevant information reduces the risk of "hallucinations" and improves factual accuracy.

Proponents of long/large context windows in LLMs, however, argue that techniques like prioritization & attention mechanisms, strategic truncation, caching mechanisms, good prompt engineering, fine-tuning can help a lot in improving the LLMs’ ability to keep track of longer/larger data inputs, allowing them to grasp or “learn” the context and its finer details more effectively - resulting in more accurate and contextually relevant responses. Google, in fact, showcased that its Gemini 1.5 Pro has shown proficiency in learning new skills from information in long/large prompts without any additional fine-tuning. This is something that cannot be achieved so simply with traditional RAG, currently.

In my view, however, when it comes to especially Enterprise Use Cases, RAG certainly offers some specific advantages (as of now, at least ??) over large/long context windows in LLMs, like the ones mentioned in the table below:

Comparison between RAG and Large/Long Context Windows in LLMs

What does the future look like?

The future landscape of RAG versus Large/Long Context Models is going to constantly evolve over the next couple of months and years. Google is already experimenting on expanding Gemini's context window to 10 million tokens, but with the current hardware and model architecture this option is still not viable for production.

As hardware evolves and research progresses, we can anticipate a reduction in latency, costs and other downsides associated with large context windows. This trend could potentially lead to a shift away from RAG applications for many use cases in the future. As LLMs become more efficient and capable of handling extensive contexts on their own, the reliance on RAG to supplement LLMs with external knowledgebase(s) may potentially decrease, signaling a significant evolution in how AI systems are designed and utilised for complex tasks.

However, for now, Enterprises can potentially consider combining RAG with long/large context models to push the boundaries of AI's capabilities. This can leverage the strengths of both technologies: the deep understanding and comprehensive processing power of long-context models, and the dynamic, up-to-date knowledge retrieval of RAG systems. Such integration could lead to AI outputs that are not only coherent and accurate over long/large contexts but also factually precise and relevant - by virtue of pulling in the most up-to-date and accurate information.

I am really interested in this topic (which is also a hotly debated one at my workplace at the moment ??) and would love to hear your thoughts and views.


Chandan B.

Digital & Sustainability Professional

10 个月

Well written. While it is horses for courses, my view - while creating larger tokens seems like an arms race, the possibility of Real-Time updates can be made possible with RAG, case in point financial news analysis, which is where the differentiator lies for me.

要查看或添加评论,请登录

Kevin Neogy的更多文章

社区洞察

其他会员也浏览了