SwiftKV: Accelerating Enterprise LLM Workloads with Knowledge Preserving Compute Reduction

SwiftKV: Accelerating Enterprise LLM Workloads with Knowledge Preserving Compute Reduction

Today, Snowflake AI Research published a new approach that significantly reduces computational costs for enterprise LLM workloads: SwiftKV. SwiftKV is open source and available on Hugging face. In this newsletter, we’ll break down why SwiftKV is groundbreaking, and what this all means for you.?

TL;DR: SwiftKV reduces compute costs on input processing (aka prefill computation).

What are tokens, input processing, and how do they impact compute cost??

Enterprise LLM workloads often involve significantly more input tokens (prompts) than output tokens (generations). Input tokens are the text provided to the model, such as instructions or context, while output tokens are the model's responses. A token is simply a unit of text, such as a word or part of a word, that the model processes. Since input processing (prefill computation) dominates computational costs, optimizing it is crucial for efficiency.

The following common enterprise LLM tasks? typically utilize long prompts but generate only a small number of output tokens:

  • Code completion: providing a model with a part of a code snippet or function (input) and asking it to generate the rest (output)
  • Text-to-SQL: asking a natural language question in English (user input) along with the semantic information of the relevant table (e.g., schema, column descriptions as system inputs). The model generates a corresponding SQL query as output.
  • Summarization: providing longer text, such as a document (input) and requesting concise summary (output)
  • Retrieval-augmented generation (RAG): asking a question (user input), scanning of various knowledge sources during retrieval (retrieval input), and a response based on the information retrieved (output).?

Snowflake’s AI Research team observed that many enterprise LLM use cases exhibit a 10:1 ratio between prompt tokens and generated tokens, meaning that for every 10 tokens in the input prompt, the model generates 1 token in response. As a result, a significant portion of LLM compute costs are often associated with processing the input prompt.

Figure. 1: Input vs output token length across various LLM Inference workloads running on Snowflake Cortex, showing inputs are over 10x longer than output tokens.

How does SwiftKV work??

LLMs are composed of multiple layers of transformer modules. One of the key components of these transformer layers is the key-value (KV) cache, which stores intermediate outputs (called keys and values) from each transformer layer. During input processing, the KV-cache is computed for each token in the input prompt. This KV-cache is then stored and re-used during the output token generation. The KV cache at each transformer layer is computed using a light weight matrix multiplication using a projection matrix. While this projection is light weight, the input to this projection matrix depends on the output from the previous transformer layer. This computation on the other hand is very costly. Since prompt is 10x larger than output on average, producing the KV cache for input prompts dominates the overall inference computation. With SwiftKV we introduce SingleInputKV, a technique that reduces the KV cache computation by leveraging a well known observation that output of transformer layers do not change much as we get deeper into the layers. With this, observation SingleInputKV optimizes KV cache computation by reusing the output from an earlier layer to generate the KV cache for multiple subsequent layers, using just the light weight projection.

By skipping compute-heavy operations in the later transformer layers, SingleInputKV provides efficiency and reduces compute costs by 50% during the input prompt processing phase. Prompt processing thus becomes significantly faster and more cost-effective.?

The architecture figure (Figure 2) visualizes the components and their relationships. For a more detailed explanation, please see our SwiftKV Blog.

Figure 2. SwiftKV re-wires transformer based LLMs to reduce pre-fill computation using SingleInputKV, reduce KV-Cache using AcrossKV, and uses light-weight distillation for knowledge recovery.? Specifically, in this example, SingleInputKV reuses the KV cache from the 4th layer to generate the KV cache for layers 5 through 8, significantly reducing prefill computations.

What is the end-to-end improvement from SwiftKV?

By applying SingleInputKV to 50% of the transformer layers, we achieve nearly a 2x reduction in prefill computation. Given that input processing dominates inference workloads, this translates to up to a 2x improvement in end-to-end throughput.

Figure 3: Inference Throughput Improvement of SwiftKV?
Despite this significant throughput gain, the accuracy tradeoff remains minimal, as demonstrated in Table 1.

How do I use it??

We’re excited to make SwiftKV accessible to the community. Here’s how you can get started:

1. Model Checkpoints: SwiftKV model checkpoints are available via HF:

2. Inference with vLLM: We’ve integrated SwiftKV optimizations into vLLM to enhance inference efficiency. To try these models with vLLM, please use our dedicated vLLM branch (currently in the process of being upstreamed), with getting-started instructions.

3. Train Your Own SwiftKV Models: Interested in customizing SwiftKV for your specific workloads? Check out the SwiftKV knowledge distillation recipe in our new post-training library, ArcticTraining (coming soon).

For more information on SwiftKV—its memory optimization features, the techniques used to distill SwiftKV while preserving accuracy, its implementation in vLLM, and how it can be combined with other system optimizations—visit our engineering blog for an in-depth exploration.

For even more technical insights, check out the arXiv paper on SwiftKV!?


Daniel Eiduzzis

Thought Leader for Data & Analytics | Author | Speaker | TDWI Expert

2 个月

Check this, Dr. Tina Klaus.

jasna hussain

PGT Computer Science/ICT Teacher, MCA, B.Sc. CS, PGCE , TESOL/TEFL Diploma,3 + years CBSE Experience

2 个月

waiting to see the output.

Madhur Singh

Data Engineer | Crafting Scalable Data Architectures & Driving Data-Driven Decisions. SnowPro Core Certified

2 个月

Reusing the computed output is great way to reduce a cost while using only important output tokens. Eagerly waiting to work on this to experience the reduced cost and the way by which the cost is reduced without affecting the processed output.

要查看或添加评论,请登录

Snowflake的更多文章

社区洞察

其他会员也浏览了