SwiftKV: Accelerating Enterprise LLM Workloads with Knowledge Preserving Compute Reduction
Today, Snowflake AI Research published a new approach that significantly reduces computational costs for enterprise LLM workloads: SwiftKV. SwiftKV is open source and available on Hugging face. In this newsletter, we’ll break down why SwiftKV is groundbreaking, and what this all means for you.?
TL;DR: SwiftKV reduces compute costs on input processing (aka prefill computation).
What are tokens, input processing, and how do they impact compute cost??
Enterprise LLM workloads often involve significantly more input tokens (prompts) than output tokens (generations). Input tokens are the text provided to the model, such as instructions or context, while output tokens are the model's responses. A token is simply a unit of text, such as a word or part of a word, that the model processes. Since input processing (prefill computation) dominates computational costs, optimizing it is crucial for efficiency.
The following common enterprise LLM tasks? typically utilize long prompts but generate only a small number of output tokens:
Snowflake’s AI Research team observed that many enterprise LLM use cases exhibit a 10:1 ratio between prompt tokens and generated tokens, meaning that for every 10 tokens in the input prompt, the model generates 1 token in response. As a result, a significant portion of LLM compute costs are often associated with processing the input prompt.
How does SwiftKV work??
LLMs are composed of multiple layers of transformer modules. One of the key components of these transformer layers is the key-value (KV) cache, which stores intermediate outputs (called keys and values) from each transformer layer. During input processing, the KV-cache is computed for each token in the input prompt. This KV-cache is then stored and re-used during the output token generation. The KV cache at each transformer layer is computed using a light weight matrix multiplication using a projection matrix. While this projection is light weight, the input to this projection matrix depends on the output from the previous transformer layer. This computation on the other hand is very costly. Since prompt is 10x larger than output on average, producing the KV cache for input prompts dominates the overall inference computation. With SwiftKV we introduce SingleInputKV, a technique that reduces the KV cache computation by leveraging a well known observation that output of transformer layers do not change much as we get deeper into the layers. With this, observation SingleInputKV optimizes KV cache computation by reusing the output from an earlier layer to generate the KV cache for multiple subsequent layers, using just the light weight projection.
By skipping compute-heavy operations in the later transformer layers, SingleInputKV provides efficiency and reduces compute costs by 50% during the input prompt processing phase. Prompt processing thus becomes significantly faster and more cost-effective.?
The architecture figure (Figure 2) visualizes the components and their relationships. For a more detailed explanation, please see our SwiftKV Blog.
领英推荐
What is the end-to-end improvement from SwiftKV?
By applying SingleInputKV to 50% of the transformer layers, we achieve nearly a 2x reduction in prefill computation. Given that input processing dominates inference workloads, this translates to up to a 2x improvement in end-to-end throughput.
How do I use it??
We’re excited to make SwiftKV accessible to the community. Here’s how you can get started:
1. Model Checkpoints: SwiftKV model checkpoints are available via HF:
2. Inference with vLLM: We’ve integrated SwiftKV optimizations into vLLM to enhance inference efficiency. To try these models with vLLM, please use our dedicated vLLM branch (currently in the process of being upstreamed), with getting-started instructions.
3. Train Your Own SwiftKV Models: Interested in customizing SwiftKV for your specific workloads? Check out the SwiftKV knowledge distillation recipe in our new post-training library, ArcticTraining (coming soon).
For more information on SwiftKV—its memory optimization features, the techniques used to distill SwiftKV while preserving accuracy, its implementation in vLLM, and how it can be combined with other system optimizations—visit our engineering blog for an in-depth exploration.
For even more technical insights, check out the arXiv paper on SwiftKV!?
Thought Leader for Data & Analytics | Author | Speaker | TDWI Expert
2 个月Check this, Dr. Tina Klaus.
PGT Computer Science/ICT Teacher, MCA, B.Sc. CS, PGCE , TESOL/TEFL Diploma,3 + years CBSE Experience
2 个月waiting to see the output.
Data Engineer | Crafting Scalable Data Architectures & Driving Data-Driven Decisions. SnowPro Core Certified
2 个月Reusing the computed output is great way to reduce a cost while using only important output tokens. Eagerly waiting to work on this to experience the reduced cost and the way by which the cost is reduced without affecting the processed output.