登录查看更多内容

Cache Augmented Generation

Gyaneshwer Kumar

Data Engineer | Applied AI/ML

发布日期: 2025年1月14日

Introduction

There's been a lot of buzz lately around Cache Augmented Generation. In a recent ArXiv Paper, researchers have introduced this new LLM response generation technique called Cache-Augmented Generation (CAG). While this sounds promising , and there are numerous System Design Problems where we know Caching is used to improve Latency and Scaling to millions of Active Users. So no surprise there. But Lets understand Cache Augmented Generation , its pros and cons and where it make sense to implement this technique and why. Before understanding Cache Augmented Generation and relevant use cases, it is also important to understand where RAG falls short.

RAG works then why an alternative

Lets take a step back and understand where RAG falls behind and why it makes sense to find an alternative. Lets also understand some important LLM Serving Metrics.

Some common LLM Serving Metrics

a. Time to First Token - How fast User starts seeing LLM Response

b. Time per Output Token - Time required to generate an output token per user

c. Latency - The overall time it takes for the LLM to generate full response

d. Throughput- Number of output tokens per second an inference server can generate

Latency = Time to First Token + Time per Output Token * No of Tokens Generated

One of the major purpose , Cache Augmented Generation serves is improve the Inference Latency ,Throughput and reducing the Retrieval Errors rate while keeping the overall architecture simple and system design low maintenance.

But is this enough to replace RAG with CAG. Not actually. There are several other factors to consider before opting between RAG vs CAG.

Factors Influencing the choice

Retrieval Latency - Vector Search introduces Latency, when multiple users are querying in parallel. Scaling also becomes a challenge with multiple concurrent queries to Vector Database.

Token Management - Retrieved chunks from Vector Database often bloat the context window.

Scalability Challenges - As and when the number of users and the search query increases, scaling becomes a challenge

Document Selection - irrelevant document retrieval can lead to suboptimal answers, reducing the system’s reliability. RAG often requires additional steps like ReRanking.

System Complexity - Integrating retrieval and generation requires careful tuning, additional infrastructure, and ongoing maintenance, complicating workflows and increasing overhead.

Hot vs Cold Data

Hot Data is what keeps changing frequently vs Cold Data is something stays constant over time. In scenarios where LLM serves more Cold Data compared to Hot Data, there are ways Cold Data can be served from Cache instead of retrieving it from a Vector Database. RAG may become an expensive technique over time, while serving Cold Data mostly.

Modern LLM Context Window

Modern day LLM's have enormous Context Length and its only increasing over time and the tokens are becoming cheaper and cheaper. This is a perfect scenario to look for Retrieval Free Generations as preloaded knowledge can now be passed as an input from Cache.

Inference Latency and Real Time Use Cases

Retrieval step adds extra Retrieval Latency and severely affects TFFT , Time to First Token. Overall significantly degrading User Experience of a Real Time Chatbot.

Cache Augmented Generation - What and Why?

Cache-Augmented Generation (CAG) is a Retrieval Free approach that eliminates the need for real-time retrieval by using preloaded knowledge and precomputed inference states. Instead of retrieving knowledge during inference, CAG integrates all relevant knowledge into a large language model’s (LLM) extended context beforehand. It utilizes a precomputed key-value (KV) cache to store and reuse the model’s inference states.

At its core, Cache Augmented Generation can be broken down in three steps:

领英推荐

LLM Evaluation, Parallel Computing, Demand…

Towards Data Science 2 个月前

How to implement Consistent Hashing

Vivek Bansal 2 个月前

Linked Lists And Data Structures - What They Are And…

Omar Ismail 3 年前

a. Precomputing External Knowledge as Cache

All of the relevant external knowledge is first preprocessed and precomputed into key-value (KV) cache.This KV cache is stored on the disk or in memory for future use.

This requires less computational costs as pre processing of knowledge occurs just once, regardless of the number of user queries.

Pre Computing KV Cache helps the LLM with more holistic and coherent understanding of the documents, which results in improved response quality.

b. Use this Cache along with Query at the time of Inference to improve Latency

During inference, the precomputed KV cache is loaded with the user’s query, and the LLM uses this to generate responses.

Retrieval Latency and Errors both are reduced, as the LLM understands the preloaded knowledge and query within its context.

c. Reset the Cache

The KV Cache grows sequentially over time during inference, with the new tokens appended to the previous ones. To maintain system performance during inference sessions, the KV Cache can be reset by truncating the new tokens.

This enables fast reinitialization since the complete cache doesn't need to be reloaded from disk.

Cache Augmented Generation, Image from Original Paper

When to Use CAG:

LLM Chatbot and Application where KnowledgeBase is limited and constrained, and can be fitted into a LLM's context window.
Need for fast, accurate, and contextually rich responses.
System Design with Low Maintenance Overhead, Easy to Deploy and Manage.

Conclusion

CAG is a more efficient, accurate, and simple way of retrieval free generation.We can also come up with a hybrid approach, combining CAG’s preloading capabilities with RAG's selective retrieval . Combining the best of both worlds may offer significant benefit in knowledge intensive workflows.

Gyaneshwer Kumar

Data Engineer | Applied AI/ML

2 个月

Here's the original Paper and GitHub Repo for Cache Augmented Generation. https://arxiv.org/pdf/2412.15605v1 https://github.com/hhhuang/CAG

要查看或添加评论，请登录

Gyaneshwer Kumar的更多文章

Storage-Network-Compute Trends for AI Ready Enterprise

2025年3月18日

Storage-Network-Compute Trends for AI Ready Enterprise

Introduction Traditionally the Enterprise Tech Stack is comprised of Compute, Networking and Storage. Thats what the…
Accelerating ETL - CPU vs GPU Tradeoff

2025年3月14日

Accelerating ETL - CPU vs GPU Tradeoff

Introduction One of the CIO's I met at a conference, said, Data is milk fresher the better. It just resonates every…

3 条评论
Reinforcement Learning and the AHA Moment

2025年3月11日

Reinforcement Learning and the AHA Moment

Introduction Reinforcement Learning is not a new topic but it has gained tremendous traction and momentum past couple…

1 条评论
Explainable AI

2025年2月22日

Explainable AI

Introduction AI systems, often operate like black boxes: we send the inputs and receive outputs, but we don't see the…
LLM Security and Observability - Responsible AI

2025年2月1日

LLM Security and Observability - Responsible AI

Introduction Building Generative AI Applications and Agents is more like Software Engineering and quiet different from…

2 条评论
Disaster Risk Monitoring System - Remote Sensing and Deep Learning

2025年1月28日

Disaster Risk Monitoring System - Remote Sensing and Deep Learning

Introduction I come from a state back in India where Floods is an yearly phenomenon taking hundreds of lives every…

1 条评论
Creating a Convolutional Neural Network from Scratch

2025年1月11日

Creating a Convolutional Neural Network from Scratch

Introduction Convolution is a primitive in Computer Vision. CNN a.

1 条评论
Getting started with JAX

2025年1月8日

Getting started with JAX

Introduction JAX-ML is a cutting-edge, scientific computing framework, created by Google in 2018. It is available as a…

1 条评论
Federated Learning and Differential Privacy

2025年1月4日

Federated Learning and Differential Privacy

Introduction Large Language Models are trained on publicly available datasets and work well in Trivial Generalist…

2 条评论
Transfer Learning

2024年12月30日

Transfer Learning

Introduction Transfer learning in Machine Learning, is the task of using a pre-trained model and applying it to a new…

See all articles

Cache Augmented Generation

Gyaneshwer Kumar

Data Engineer | Applied AI/ML

Introduction

RAG works then why an alternative

Factors Influencing the choice

Cache Augmented Generation - What and Why?

领英推荐

When to Use CAG:

Conclusion

Gyaneshwer Kumar的更多文章

社区洞察

其他会员也浏览了

Data Science Applications in Web 3.0

4 Database Trends Data-Intensive Businesses Need to Watch in 2025

Time Complexity in Data Structure

FiftyOne Computer Vision Aggregation Tips and Tricks — Nov 25, 2022

Unlocking the Power of Milvus: Exploring the Next Generation of Vector Databases

Analytics and Data Science News for the Week of May 3; Updates from Databricks, DataRobot, MicroStrategy & More

Understanding Ethereum's RLP Serialization Protocol

Data, Dragons & Digital Dreams: The Saga Of Microservice Manor

openEuler SIGs News & Updates - By September 30, 2024

Expectations from Data - part 2

Introduction

RAG works then why an alternative

Factors Influencing the choice

Cache Augmented Generation - What and Why?

领英推荐

When to Use CAG:

Conclusion

Gyaneshwer Kumar的更多文章

Storage-Network-Compute Trends for AI Ready Enterprise

Accelerating ETL - CPU vs GPU Tradeoff

Reinforcement Learning and the AHA Moment

Explainable AI

LLM Security and Observability - Responsible AI

Disaster Risk Monitoring System - Remote Sensing and Deep Learning

Creating a Convolutional Neural Network from Scratch

Getting started with JAX

Federated Learning and Differential Privacy

Transfer Learning

社区洞察

其他会员也浏览了

Data Science Applications in Web 3.0

4 Database Trends Data-Intensive Businesses Need to Watch in 2025

Time Complexity in Data Structure

FiftyOne Computer Vision Aggregation Tips and Tricks — Nov 25, 2022

Unlocking the Power of Milvus: Exploring the Next Generation of Vector Databases

Analytics and Data Science News for the Week of May 3; Updates from Databricks, DataRobot, MicroStrategy & More

Understanding Ethereum's RLP Serialization Protocol

Data, Dragons & Digital Dreams: The Saga Of Microservice Manor

openEuler SIGs News & Updates - By September 30, 2024

Expectations from Data - part 2