登录查看更多内容

How does Retrieval-Based Speculative Decoding Improve RAG Performance

Daniel Chen

CEO & Co-Founder @ Quilt

发布日期: 2024年4月29日

A couple of weeks ago, a team from Princeton published a paper around Retrieval-Based Speculative Decoding (REST, don’t ask me how they came up with the acronym). In the spirit of sharing our learnings around optimizing RAG for the enterprise, we’d love to share some notes around how we thought about taking learnings from the paper and applying it to our own pipeline.

What is speculative decoding?

In short, speculative decoding is a way to speed up inference by running a small language model (LM) alongside a large model. The small model attempts to predict the tokens of the large model in sequence, and the large model only needs to verify that the predicted tokens match what would have been generated. If the predictions match, this significantly speeds up the decoding step during inference by reducing KV-cache computations.

How does REST improve speculative decoding?

The proposed structure with REST is to use an embeddings datastore instead of a small language model to draft tokens alongside the large model. The tokens are selected using a prefix trie. The intuition is that, during retrieval, many of the generated tokens will exactly match the snippets stored in the embeddings model. There appears to be a moderate speedup versus speculative decoding for 7B parameter models and a smaller speedup for 13B parameter models. The main advantage is that the REST token drafting would primarily use CPUs (prefix lookup via trie) versus the LM competing for GPU cycles alongside the LLM.

Why does it matter?

Faster inference. In asynchronous environments, this might not matter so much (e.g., when you hand off an RFP for a knowledge assistant to fill out), but in synchronous live environments (e.g., when having a conversation with the agent), every millisecond matters.?

Quilt has an “Answer Bank”, a set of pre-approved question and answer pairs that a customer uploaded through previously completed questionnaires, or by correcting incorrect answers. In some cases, it is very important to the customer to be able to answer verbatim. The embeddings for these question and answer pairs are stored alongside document chunks. There is an easy case (the “default” case for pre-LLM RFP products) where the verbatim answer is emitted on an exact question match, or perhaps when that snippet ranks the highest. In this case, we can skip the inference step in answer construction altogether. In a more complex case where we would be able to affect the completion stream of the model, verbatim answer “runs” might be several sentences, or a couple dozen tokens, leading to what is likely a much more significant speedup than what is shown in the paper.

Here’s a concrete example. A user receives an questionnaire that asks,

领英推荐

Top RAG Papers of the Week (November Week 1, 2024)

Kalyan KS 4 个月前

??Top ML Papers of the Week

DAIR.AI 1 年前

How to Create Custom LLMs From Scratch - Interview…

Vincent Granville 7 个月前

Question: Do you store user data in the United States, and in what regions?

Because this is a common question, the Answer Bank contains a very similar question with a nearly perfect answer.

Question: In what regions do you store user data?

Answer: In AWS us-east-1 and us-west-2

Quilt searches the Answer Bank, finds this item, and passes it to the LLM along with the original question. The answer we found doesn’t exactly match what was asked, so we can’t simply respond without using the LLM. Fortunately, the answer is so similar that we can generate just a few tokens, then guess the rest of the LLM output!

Question: Do you store user data in the United States, and in what regions?

Answer: Yes, <Answer bank answer pasted here>

Now instead of computing every token in the answer one at a time, we can compute all of the tokens in parallel, reducing latency dramatically. This is only possible because we were able to correctly guess the LLM output. There are also obvious immediate applications to multipart questions.

More generally, optimizations around token prediction mirror things we are experimenting with in other environments (not LLM token prediction, but something else). Understanding that tokens generated during RAG-based inference will likely match what is stored in a RAG datastore is a nice intuition that has broader applications.

Zhenyu He

PhD student at Peking University

10 个月

Thanks for sharing our research. RAG is awesome!

3 次回应

Andrew Z.

Automating “first drafts” for reports in transactional risk ?? We're hiring!

10 个月

This is pretty dope. Will enable for much higher quality product experiences. Unfortunate naming convention though lol

4 次回应

Daniel Chen

CEO & Co-Founder @ Quilt

10 个月

Paper here for those interested: https://arxiv.org/abs/2311.08252

3 次回应

Avi Sahi

AI First Sales Leader I EVP Kore AI | University Chair AI | Advisor Gen AI

10 个月

Daniel, this is truly fascinating work! Your deep dive into speculative decoding for RAG pipelines is both impressive and enlightening. Your innovative approach using an embeddings datastore showcases your commitment to pushing the boundaries of AI optimization. Thank you for sharing this insightful peek behind the curtain on your cutting-edge research!

4 次回应

查看更多评论

要查看或添加评论，请登录

Daniel Chen的更多文章

Speeding up LLM inference

2024年6月20日

Speeding up LLM inference

In computer architecture, there is a concept of speculative execution that speeds up the execution of programs. It does…

2 条评论
The Death of the CRM

2024年6月3日

The Death of the CRM

One of the worst kept secrets in VC and startup circles is that LLMs have fundamentally changed how we interact with…

65 条评论
OpenAI DevDay Predictions

2023年11月14日

OpenAI DevDay Predictions

..

26 条评论

How does Retrieval-Based Speculative Decoding Improve RAG Performance

Daniel Chen

CEO & Co-Founder @ Quilt

领英推荐

Daniel Chen的更多文章

社区洞察

其他会员也浏览了

Can we detect LLM hallucinations?

Algorithms, Simplified!

Exploring Linear Transformations in AlphaFold 3 Pairformer: A Simplified Demo

Vector RAG w/o fine tuned LLM

What is the main benefit of Multi-Token Prediction (MTP)?

First use of OpenAI o1 Model

Functionary V2.4 Model Release

Algorithms to live by :

The Strawberry (Q*) in AutoML and Return of Model Compression

The Swiss Army Infinitesimal Jackknife: A New Frontier in Model Variability Estimation Financial Statement Analysis with Large Language

领英推荐

Daniel Chen的更多文章

Speeding up LLM inference

The Death of the CRM

OpenAI DevDay Predictions

社区洞察

其他会员也浏览了

Can we detect LLM hallucinations?

Algorithms, Simplified!

Exploring Linear Transformations in AlphaFold 3 Pairformer: A Simplified Demo

Vector RAG w/o fine tuned LLM

What is the main benefit of Multi-Token Prediction (MTP)?

First use of OpenAI o1 Model

Functionary V2.4 Model Release

Algorithms to live by :

The Strawberry (Q*) in AutoML and Return of Model Compression

The Swiss Army Infinitesimal Jackknife: A New Frontier in Model Variability Estimation Financial Statement Analysis with Large Language