Phased Approach | Reports of the death of RAG has been greatly exaggerated

Phased Approach | Reports of the death of RAG has been greatly exaggerated


Is it just me or does every feature released by OpenAI and Anthropic lead to breathless headlines in the AI press that this is the end of RAG?

The latest feature that leads to these headlines is prompt caching released in Claude. My theory is that most of these AI YouTubers / writers have never actually used RAG for anything more than chatting with a PDF and so have no idea what the business and enterprise applications of it are.

So today I want to talk about this new feature from Anthropic why it is great but I also want to talk about what RAG is and what it does and how features like these are at best complimentary to RAG in an actual business context.


Claude Anthropic


But before we dive into why prompt caching isn’t the silver bullet that some might think it is, let’s take a step back and talk about Retrieval-Augmented Generation (RAG)—what it really is and why it’s so crucial, especially for businesses dealing with large databases of documents.

What Exactly is RAG?

At its core, RAG is a hybrid approach that combines two powerful capabilities: retrieval and generation. Imagine you’re running a company with a vast knowledge base—think thousands of documents, technical manuals, customer interactions, or legal contracts. Finding the right piece of information quickly and accurately is a massive challenge. This is where RAG comes into play.

RAG uses a retrieval mechanism to sift through all that data and pull out the most relevant chunks of information based on the query. It doesn’t stop there, though. Once the relevant data is retrieved, it hands that over to a language model (the generation part) to craft a coherent, contextually relevant response. This means the AI isn’t just spitting out pre-existing text—it’s creating a nuanced answer based on the most pertinent data available.




Why is RAG So Important for Businesses?

Now, let’s connect the dots to why this matters in a business context. If you’re managing a large-scale operation, your data isn’t just big—it’s massive. We’re talking millions of tokens worth of information, spread across various systems and formats. In such environments, RAG becomes indispensable because:

  1. Scalability: RAG is built to handle the vast amounts of data that businesses accumulate. It doesn’t just rely on a static, pre-defined context; it actively retrieves the latest, most relevant information whenever it’s needed, making it perfect for dynamic, ever-evolving data landscapes.
  2. Accuracy: Businesses can’t afford to get things wrong. Whether it’s providing customer support, making legal decisions, or analysing technical data, accuracy is key. RAG’s retrieval component ensures that the generated responses are based on the most up-to-date and relevant data, which is critical in high-stakes environments.
  3. Contextual Understanding: In a business setting, understanding the context is everything. RAG’s ability to pull from a vast pool of data and generate responses that are contextually aware means that the AI can deliver insights that are not just accurate but also highly relevant to the specific query or problem at hand.

So, while prompt caching is an exciting development, it's important to understand its true capabilities and limitations—especially when comparing it to something as robust as Retrieval-Augmented Generation (RAG).

What Does Claude’s Prompt Caching Really Do?

Claude’s prompt caching is a feature designed to make your interactions with AI models more efficient, particularly in scenarios where you need to repeatedly access the same information within a short period. The basic idea is this: if you’re working with a large document or a complex set of instructions, you can cache that information so that it doesn’t have to be reprocessed every time you interact with the model. This can dramatically reduce costs—by up to 90%—and improve response times by up to 85%, according to Anthropic.



Here’s how it works: when you cache a prompt, Claude stores that information and keeps it ready for future use. If your session remains active, the cache persists, allowing for quick and cost-effective responses to subsequent queries. However, there’s a catch—this cache will only stay alive for five minutes of idle time. This means that if you don’t use the cached data within five minutes, the cache is cleared, and you’ll have to reload the information, which incurs additional costs and time.

The Limitations of Prompt Caching in Business Contexts

Now, this feature is fantastic for specific, short-term tasks where you’re dealing with stable, reusable content over a brief period—like debugging a codebase or analyzing a single document. But when we talk about large-scale enterprise environments, things get a bit more complicated.

In a business setting, you’re often dealing with vast amounts of data—millions of tokens spread across countless documents, databases, and knowledge repositories. The 200,000-token context window that Claude offers is substantial, but it’s still limited. In real-world applications, you’ll often need to pull from multiple sources that far exceed this limit. This is where the limitations of prompt caching become evident.

Why RAG Remains Essential

RAG excels in scenarios where you need to gather information from multiple, disparate sources. Instead of trying to fit everything into a single context window, RAG allows the model to retrieve relevant chunks of data from a vast knowledge base, embedding and combining them to generate a comprehensive, contextually relevant response. This makes RAG incredibly powerful for enterprises where the data landscape is both vast and varied.

While Claude’s prompt caching can be a game-changer for tasks that require multiple prompts on the same corpus of data, it simply can’t handle the scale and complexity that RAG is built for. For instance, if you’re interacting with a large database of customer service logs, legal documents, or technical manuals, you’re likely dealing with far more information than can be cached or processed in a single context window. RAG’s ability to retrieve and integrate multiple pieces of data from various sources ensures that you get a complete and accurate response, regardless of the size or complexity of the data involved.

The Bottom Line: Complementary, Not a Replacement

So, while prompt caching is a valuable tool—especially for reducing costs and speeding up interactions in specific scenarios—it’s not a replacement for RAG. Instead, it serves as a complementary feature that can enhance the efficiency of your AI systems when used appropriately. In cases where you’re working with a single document or a stable set of data points, prompt caching can save time and money. But for the larger, more complex tasks that define enterprise AI applications, RAG remains indispensable.

In essence, prompt caching is a smart way to optimise performance in the short term, but when it comes to handling the sprawling, interconnected data environments typical of large businesses, RAG’s comprehensive retrieval and generation capabilities are still the gold standard. So, next time you hear that prompt caching is the end of RAG, remember that it’s just one piece of a much larger puzzle in the world of AI.

Why are people always Ragging on RAG?

I have a couple of theories on this.

Theory 1: As I said above the AI press is filled with a lot of hobbyists that don't actually use these tools in business. They can load a VERY LARGE PDF in a vector database and ask it questions and then 2 months later they can load the PDF directly into the context window of the LLM and so they miss the fact that business are not just doing small POCs but have millions of documents.

Theory 2: Wishful thinking - RAG is very hard work. Its not just a vector db and a langchain integration. There are so many variables and tweaks to get a RAG pipeline to work consistently. People in the industry would love if one of the Gen AI companies could magically solve this issue so that they didn't have to worry about embedding models, chunk sizes, semantic search algorithms, graph database integrations etc.

If you made it this far in the Newsletter


As always, if you are curious about how the topics discussed in this newsletter relate to your business or projects that you are working on, please send a message and I'd be happy to have a chat and advise.



Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

1 个月

The newsletter's emphasis on prompt caching as a game-changer might overshadow the enduring value of RAG for complex knowledge extraction. Recent research from Stanford suggests that human-in-the-loop systems, combining AI with expert oversight, could offer superior performance in certain domains. How would Phased Approach integrate human expertise into its recommendations for businesses leveraging AI?

回复

要查看或添加评论,请登录

Richard Skinner的更多文章

社区洞察

其他会员也浏览了