Cache Augmented Generation
Cache Augmented Generation

Cache Augmented Generation

Introduction

There's been a lot of buzz lately around Cache Augmented Generation. In a recent ArXiv Paper, researchers have introduced this new LLM response generation technique called Cache-Augmented Generation (CAG). While this sounds promising , and there are numerous System Design Problems where we know Caching is used to improve Latency and Scaling to millions of Active Users. So no surprise there. But Lets understand Cache Augmented Generation , its pros and cons and where it make sense to implement this technique and why. Before understanding Cache Augmented Generation and relevant use cases, it is also important to understand where RAG falls short.


RAG works then why an alternative

Lets take a step back and understand where RAG falls behind and why it makes sense to find an alternative. Lets also understand some important LLM Serving Metrics.

Some common LLM Serving Metrics

a. Time to First Token - How fast User starts seeing LLM Response

b. Time per Output Token - Time required to generate an output token per user

c. Latency - The overall time it takes for the LLM to generate full response

d. Throughput- Number of output tokens per second an inference server can generate

Latency = Time to First Token + Time per Output Token * No of Tokens Generated

One of the major purpose , Cache Augmented Generation serves is improve the Inference Latency ,Throughput and reducing the Retrieval Errors rate while keeping the overall architecture simple and system design low maintenance.

But is this enough to replace RAG with CAG. Not actually. There are several other factors to consider before opting between RAG vs CAG.

Factors Influencing the choice

Retrieval Latency - Vector Search introduces Latency, when multiple users are querying in parallel. Scaling also becomes a challenge with multiple concurrent queries to Vector Database.

Token Management - Retrieved chunks from Vector Database often bloat the context window.

Scalability Challenges - As and when the number of users and the search query increases, scaling becomes a challenge

Document Selection - irrelevant document retrieval can lead to suboptimal answers, reducing the system’s reliability. RAG often requires additional steps like ReRanking.

System Complexity - Integrating retrieval and generation requires careful tuning, additional infrastructure, and ongoing maintenance, complicating workflows and increasing overhead.

Hot vs Cold Data

Hot Data is what keeps changing frequently vs Cold Data is something stays constant over time. In scenarios where LLM serves more Cold Data compared to Hot Data, there are ways Cold Data can be served from Cache instead of retrieving it from a Vector Database. RAG may become an expensive technique over time, while serving Cold Data mostly.

Modern LLM Context Window

Modern day LLM's have enormous Context Length and its only increasing over time and the tokens are becoming cheaper and cheaper. This is a perfect scenario to look for Retrieval Free Generations as preloaded knowledge can now be passed as an input from Cache.

Inference Latency and Real Time Use Cases

Retrieval step adds extra Retrieval Latency and severely affects TFFT , Time to First Token. Overall significantly degrading User Experience of a Real Time Chatbot.


Cache Augmented Generation - What and Why?

Cache-Augmented Generation (CAG) is a Retrieval Free approach that eliminates the need for real-time retrieval by using preloaded knowledge and precomputed inference states. Instead of retrieving knowledge during inference, CAG integrates all relevant knowledge into a large language model’s (LLM) extended context beforehand. It utilizes a precomputed key-value (KV) cache to store and reuse the model’s inference states.


At its core, Cache Augmented Generation can be broken down in three steps:

a. Precomputing External Knowledge as Cache

All of the relevant external knowledge is first preprocessed and precomputed into key-value (KV) cache.This KV cache is stored on the disk or in memory for future use.

This requires less computational costs as pre processing of knowledge occurs just once, regardless of the number of user queries.

KV Cache

Pre Computing KV Cache helps the LLM with more holistic and coherent understanding of the documents, which results in improved response quality.


b. Use this Cache along with Query at the time of Inference to improve Latency

During inference, the precomputed KV cache is loaded with the user’s query, and the LLM uses this to generate responses.

Response Generation

Retrieval Latency and Errors both are reduced, as the LLM understands the preloaded knowledge and query within its context.


c. Reset the Cache

The KV Cache grows sequentially over time during inference, with the new tokens appended to the previous ones. To maintain system performance during inference sessions, the KV Cache can be reset by truncating the new tokens.

Truncate Tokens and Cache Reset

This enables fast reinitialization since the complete cache doesn't need to be reloaded from disk.


Cache Augmented Generation, Image from Original Paper


When to Use CAG:

  • LLM Chatbot and Application where KnowledgeBase is limited and constrained, and can be fitted into a LLM's context window.
  • Need for fast, accurate, and contextually rich responses.
  • System Design with Low Maintenance Overhead, Easy to Deploy and Manage.


Conclusion

CAG is a more efficient, accurate, and simple way of retrieval free generation.We can also come up with a hybrid approach, combining CAG’s preloading capabilities with RAG's selective retrieval . Combining the best of both worlds may offer significant benefit in knowledge intensive workflows.










Gyaneshwer Kumar

Data Engineer | Applied AI/ML

2 个月

Here's the original Paper and GitHub Repo for Cache Augmented Generation. https://arxiv.org/pdf/2412.15605v1 https://github.com/hhhuang/CAG

回复

要查看或添加评论,请登录

Gyaneshwer Kumar的更多文章

  • Storage-Network-Compute Trends for AI Ready Enterprise

    Storage-Network-Compute Trends for AI Ready Enterprise

    Introduction Traditionally the Enterprise Tech Stack is comprised of Compute, Networking and Storage. Thats what the…

  • Accelerating ETL - CPU vs GPU Tradeoff

    Accelerating ETL - CPU vs GPU Tradeoff

    Introduction One of the CIO's I met at a conference, said, Data is milk fresher the better. It just resonates every…

    3 条评论
  • Reinforcement Learning and the AHA Moment

    Reinforcement Learning and the AHA Moment

    Introduction Reinforcement Learning is not a new topic but it has gained tremendous traction and momentum past couple…

    1 条评论
  • Explainable AI

    Explainable AI

    Introduction AI systems, often operate like black boxes: we send the inputs and receive outputs, but we don't see the…

  • LLM Security and Observability - Responsible AI

    LLM Security and Observability - Responsible AI

    Introduction Building Generative AI Applications and Agents is more like Software Engineering and quiet different from…

    2 条评论
  • Disaster Risk Monitoring System - Remote Sensing and Deep Learning

    Disaster Risk Monitoring System - Remote Sensing and Deep Learning

    Introduction I come from a state back in India where Floods is an yearly phenomenon taking hundreds of lives every…

    1 条评论
  • Creating a Convolutional Neural Network from Scratch

    Creating a Convolutional Neural Network from Scratch

    Introduction Convolution is a primitive in Computer Vision. CNN a.

    1 条评论
  • Getting started with JAX

    Getting started with JAX

    Introduction JAX-ML is a cutting-edge, scientific computing framework, created by Google in 2018. It is available as a…

    1 条评论
  • Federated Learning and Differential Privacy

    Federated Learning and Differential Privacy

    Introduction Large Language Models are trained on publicly available datasets and work well in Trivial Generalist…

    2 条评论
  • Transfer Learning

    Transfer Learning

    Introduction Transfer learning in Machine Learning, is the task of using a pre-trained model and applying it to a new…

社区洞察

其他会员也浏览了