Cache Augmented Generation
Introduction
There's been a lot of buzz lately around Cache Augmented Generation. In a recent ArXiv Paper, researchers have introduced this new LLM response generation technique called Cache-Augmented Generation (CAG). While this sounds promising , and there are numerous System Design Problems where we know Caching is used to improve Latency and Scaling to millions of Active Users. So no surprise there. But Lets understand Cache Augmented Generation , its pros and cons and where it make sense to implement this technique and why. Before understanding Cache Augmented Generation and relevant use cases, it is also important to understand where RAG falls short.
RAG works then why an alternative
Lets take a step back and understand where RAG falls behind and why it makes sense to find an alternative. Lets also understand some important LLM Serving Metrics.
Some common LLM Serving Metrics
a. Time to First Token - How fast User starts seeing LLM Response
b. Time per Output Token - Time required to generate an output token per user
c. Latency - The overall time it takes for the LLM to generate full response
d. Throughput- Number of output tokens per second an inference server can generate
Latency = Time to First Token + Time per Output Token * No of Tokens Generated
One of the major purpose , Cache Augmented Generation serves is improve the Inference Latency ,Throughput and reducing the Retrieval Errors rate while keeping the overall architecture simple and system design low maintenance.
But is this enough to replace RAG with CAG. Not actually. There are several other factors to consider before opting between RAG vs CAG.
Factors Influencing the choice
Retrieval Latency - Vector Search introduces Latency, when multiple users are querying in parallel. Scaling also becomes a challenge with multiple concurrent queries to Vector Database.
Token Management - Retrieved chunks from Vector Database often bloat the context window.
Scalability Challenges - As and when the number of users and the search query increases, scaling becomes a challenge
Document Selection - irrelevant document retrieval can lead to suboptimal answers, reducing the system’s reliability. RAG often requires additional steps like ReRanking.
System Complexity - Integrating retrieval and generation requires careful tuning, additional infrastructure, and ongoing maintenance, complicating workflows and increasing overhead.
Hot vs Cold Data
Hot Data is what keeps changing frequently vs Cold Data is something stays constant over time. In scenarios where LLM serves more Cold Data compared to Hot Data, there are ways Cold Data can be served from Cache instead of retrieving it from a Vector Database. RAG may become an expensive technique over time, while serving Cold Data mostly.
Modern LLM Context Window
Modern day LLM's have enormous Context Length and its only increasing over time and the tokens are becoming cheaper and cheaper. This is a perfect scenario to look for Retrieval Free Generations as preloaded knowledge can now be passed as an input from Cache.
Inference Latency and Real Time Use Cases
Retrieval step adds extra Retrieval Latency and severely affects TFFT , Time to First Token. Overall significantly degrading User Experience of a Real Time Chatbot.
Cache Augmented Generation - What and Why?
Cache-Augmented Generation (CAG) is a Retrieval Free approach that eliminates the need for real-time retrieval by using preloaded knowledge and precomputed inference states. Instead of retrieving knowledge during inference, CAG integrates all relevant knowledge into a large language model’s (LLM) extended context beforehand. It utilizes a precomputed key-value (KV) cache to store and reuse the model’s inference states.
At its core, Cache Augmented Generation can be broken down in three steps:
领英推荐
a. Precomputing External Knowledge as Cache
All of the relevant external knowledge is first preprocessed and precomputed into key-value (KV) cache.This KV cache is stored on the disk or in memory for future use.
This requires less computational costs as pre processing of knowledge occurs just once, regardless of the number of user queries.
Pre Computing KV Cache helps the LLM with more holistic and coherent understanding of the documents, which results in improved response quality.
b. Use this Cache along with Query at the time of Inference to improve Latency
During inference, the precomputed KV cache is loaded with the user’s query, and the LLM uses this to generate responses.
Retrieval Latency and Errors both are reduced, as the LLM understands the preloaded knowledge and query within its context.
c. Reset the Cache
The KV Cache grows sequentially over time during inference, with the new tokens appended to the previous ones. To maintain system performance during inference sessions, the KV Cache can be reset by truncating the new tokens.
This enables fast reinitialization since the complete cache doesn't need to be reloaded from disk.
When to Use CAG:
Conclusion
CAG is a more efficient, accurate, and simple way of retrieval free generation.We can also come up with a hybrid approach, combining CAG’s preloading capabilities with RAG's selective retrieval . Combining the best of both worlds may offer significant benefit in knowledge intensive workflows.
Data Engineer | Applied AI/ML
2 个月Here's the original Paper and GitHub Repo for Cache Augmented Generation. https://arxiv.org/pdf/2412.15605v1 https://github.com/hhhuang/CAG