How does Retrieval-Based Speculative Decoding Improve RAG Performance
A couple of weeks ago, a team from Princeton published a paper around Retrieval-Based Speculative Decoding (REST, don’t ask me how they came up with the acronym). In the spirit of sharing our learnings around optimizing RAG for the enterprise, we’d love to share some notes around how we thought about taking learnings from the paper and applying it to our own pipeline.
What is speculative decoding?
In short, speculative decoding is a way to speed up inference by running a small language model (LM) alongside a large model. The small model attempts to predict the tokens of the large model in sequence, and the large model only needs to verify that the predicted tokens match what would have been generated. If the predictions match, this significantly speeds up the decoding step during inference by reducing KV-cache computations.
How does REST improve speculative decoding?
The proposed structure with REST is to use an embeddings datastore instead of a small language model to draft tokens alongside the large model. The tokens are selected using a prefix trie. The intuition is that, during retrieval, many of the generated tokens will exactly match the snippets stored in the embeddings model. There appears to be a moderate speedup versus speculative decoding for 7B parameter models and a smaller speedup for 13B parameter models. The main advantage is that the REST token drafting would primarily use CPUs (prefix lookup via trie) versus the LM competing for GPU cycles alongside the LLM.
Why does it matter?
Faster inference. In asynchronous environments, this might not matter so much (e.g., when you hand off an RFP for a knowledge assistant to fill out), but in synchronous live environments (e.g., when having a conversation with the agent), every millisecond matters.?
Quilt has an “Answer Bank”, a set of pre-approved question and answer pairs that a customer uploaded through previously completed questionnaires, or by correcting incorrect answers. In some cases, it is very important to the customer to be able to answer verbatim. The embeddings for these question and answer pairs are stored alongside document chunks. There is an easy case (the “default” case for pre-LLM RFP products) where the verbatim answer is emitted on an exact question match, or perhaps when that snippet ranks the highest. In this case, we can skip the inference step in answer construction altogether. In a more complex case where we would be able to affect the completion stream of the model, verbatim answer “runs” might be several sentences, or a couple dozen tokens, leading to what is likely a much more significant speedup than what is shown in the paper.
Here’s a concrete example. A user receives an questionnaire that asks,
领英推荐
Question: Do you store user data in the United States, and in what regions?
Because this is a common question, the Answer Bank contains a very similar question with a nearly perfect answer.
Question: In what regions do you store user data?
Answer: In AWS us-east-1 and us-west-2
Quilt searches the Answer Bank, finds this item, and passes it to the LLM along with the original question. The answer we found doesn’t exactly match what was asked, so we can’t simply respond without using the LLM. Fortunately, the answer is so similar that we can generate just a few tokens, then guess the rest of the LLM output!
Question: Do you store user data in the United States, and in what regions?
Answer: Yes, <Answer bank answer pasted here>
Now instead of computing every token in the answer one at a time, we can compute all of the tokens in parallel, reducing latency dramatically. This is only possible because we were able to correctly guess the LLM output. There are also obvious immediate applications to multipart questions.
More generally, optimizations around token prediction mirror things we are experimenting with in other environments (not LLM token prediction, but something else). Understanding that tokens generated during RAG-based inference will likely match what is stored in a RAG datastore is a nice intuition that has broader applications.
PhD student at Peking University
10 个月Thanks for sharing our research. RAG is awesome!
Automating “first drafts” for reports in transactional risk ?? We're hiring!
10 个月This is pretty dope. Will enable for much higher quality product experiences. Unfortunate naming convention though lol
CEO & Co-Founder @ Quilt
10 个月Paper here for those interested: https://arxiv.org/abs/2311.08252
AI First Sales Leader I EVP Kore AI | University Chair AI | Advisor Gen AI
10 个月Daniel, this is truly fascinating work! Your deep dive into speculative decoding for RAG pipelines is both impressive and enlightening. Your innovative approach using an embeddings datastore showcases your commitment to pushing the boundaries of AI optimization. Thank you for sharing this insightful peek behind the curtain on your cutting-edge research!