AI Performance 2.7.2024

Hello Friends,

I am back with another edition of the AI performance newsletter. Today’s newsletter highlights some exciting improvements in the attention mechanisms - Hippoattention, RadixAttention, Flashinfer (Cascade inference), the pace of improvements in optimizing the attention mechanisms in LLMs is breathtaking.?

Beyond that, this newsletter takes you back to some of the basics of LLM inference and KV cache management mechanisms.

As always, all the articles are accompanied by AI generated summaries for easy consumption of the material.?

Happy reading, and if you have any feedback please send it my way.

Attention Mechanism Improvements

8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2 | by HippoML Blog?

The text discusses HippoML's development of 8bit post-training quantization (PTQ) for generative AI models. HippoML has found that their comprehensive approach to 8bit PTQ allows models like Stable Diffusion to run entirely in 8bit without compromising quality.?

The text specifically focuses on HippoML's FP8 Multi-Head Attention module, HippoAttention. Key points:

- Developing high-performance FP8 fused attention is challenging due to differences from FP16 instructions. Even recent modules like NVIDIA CUDNN lack FP8 attention.??

- Benchmark results show HippoAttention is 1.55-3X faster than the FP16 FlashAttentionV2 baseline in various configurations, for both causal and non-causal attention.

- HippoAttention supports functionality needed for state-of-the-art models, like Multi-Query Attention, variable sequence lengths, etc.

- HippoML is working on FP8 benchmarks for full models, not just attention. Preliminary AMD MI-300X results show potential for up to 1 PFLOPS on attention workloads.

In summary, the text highlights HippoML's progress in fast and accurate 8bit quantization through innovations like the HippoAttention module, demonstrating speedups over FP16 baselines and compatibility with advanced models. Their work could enable much faster 8bit inference across generative AI.

Fast and Expressive LLM Inference with RadixAttention and SGLang | LMSYS Org?

The text introduces SGLang, a new system to improve the efficiency of complex programs utilizing large language models (LLMs). SGLang consists of optimizations in both the backend runtime and frontend language:

Backend Optimization - RadixAttention

- Automatically reuses key-value (KV) caches across LLM requests with common prefixes via a radix tree data structure for efficient prefix search and insertion. This enables caching and sharing intermediate computations.??

- Implements least-recently-used (LRU) eviction policy in the radix tree along with cache-aware scheduling to maximize cache hit rates.

- Accommodates diverse KV cache reuse patterns seen in LLM workloads that confound existing systems.

Frontend Language - SGLang??

- Domain-specific language embedded in Python for controlling LLM generation.

- Supports advanced prompting techniques, control flow, multi-modality, decoding constraints.?

- Flexible execution via interpreter or compilation to a dataflow graph.

- New primitives for parallelism and batching compared to prior work.

Together these components improve throughput by up to 5x over baseline systems across a range of LLM benchmark tasks. The system is open-sourced.

Flashinfer - Kernel Library for LLM Serving?

The text introduces FlashInfer, an open-source library for accelerating large language model (LLM) serving. FlashInfer provides optimized attention kernels that cover common LLM serving use cases, achieving state-of-the-art performance.

Key ideas:

1. The text analyzes the performance bottlenecks in LLM serving attention computation, identifying that decode attention is IO-bound while prefill/append attention can be compute-bound. Different optimization strategies are needed.?

2. FlashInfer incorporates optimizations like FlashAttention and Flash-Decoding which fuse multi-head attention into a single kernel to avoid materializing the full attention matrix. Additional optimizations like split-KV and allowing mixed precision in attention phases are added.

3. FlashInfer accelerates grouped-query attention and fused Rotary Positional Embedding attention for compressed LLM serving. It also implements quantized attention kernels to support low-precision LLM serving.

4. For batch decoding, FlashInfer optimizes page attention using techniques like prefetching page indices into shared memory. This minimizes the performance impact of small page sizes.

5. FlashInfer aims to cover attention computation needs across diverse LLM serving systems and hardware platforms. It provides easy integration for existing systems via its C++ and PyTorch APIs.

The key focus is providing performant attention kernels to accelerate real-world LLM serving workloads, adapting to hardware constraints and emerging model compression techniques. Community involvement is welcomed to expand platform and model support.

Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding | FlashInfer?

The text proposes a method called "Cascade Inference" to accelerate batch decoding of text generation tasks that involve a shared prefix (prompt) across requests.?

Background:

- GPU memory hierarchy means global memory access is much slower than shared memory.?

- Multi-query kernels process multiple requests in one thread block to maximize reuse of shared KV cache.

- Single-query kernels process one request per thread block to ensure parallelism.

Issue:

- Multi-query kernels don't work when KV caches differ across requests.?

- Single-query kernels are inefficient for shared prefixes.

Solution - Cascade Inference:

- Use multi-query kernel for attention between queries and shared prefix. Store shared KV cache in fast GPU shared memory.

- Use single-query kernel for attention between queries and unique suffixes.?

- Merge attention states from the two steps using a commutative/associative operator.

This decomposition allows maximizing reuse for the shared prefix while retaining parallelism.

Results:

- Up to 31x speedup over baseline on an Nvidia H100 for long shared prompts and large batch size.?

- Larger speedups when shared prefix dominates computation time.

The idea can be extended to multiple cascade levels and multiple shared prefixes.

In summary, Cascade Inference exploits the GPU memory hierarchy by handling computation for the shared prompt separately from the unique suffixes to maximize performance. The recursive merge operator enables seamlessly combining the attention states.

LLM Inference, and the KV Cache

The KV Cache: Memory Usage in Transformers?

LLM Inference Series: 2. The two-phase process behind LLMs’ responses | by Pierre Lienhart | Dec, 2023 | Medium?

LLM Inference Series: 3. KV caching unveiled | by Pierre Lienhart | Dec, 2023 | Medium?

LLM Inference Series: 4. KV caching, a deeper look | by plienhar | Jan, 2024 | Medium?

LLM Inference Series: 5. Dissecting model performance | by Pierre Lienhart | Feb, 2024 | Medium?

The posts explain the two-phase process behind LLM text generation: the initiation phase where the prompt is processed, and the decoding/generation phase where tokens are generated one-by-one in an auto-regressive manner.?

A key optimization called KV caching is introduced, which caches the layer outputs (keys and values) during decoding to avoid redundant computations. However, the growing KV cache consumes significant GPU memory. Various techniques to reduce the KV cache size are discussed, including novel attention architectures like MQA and GQA, cache compression strategies, and efficient memory management with PagedAttention.

The posts analyze different performance bottlenecks - compute-bound, memory bandwidth-bound, communications-bound, and overhead-bound. Identifying the primary bottleneck is critical to apply the right optimization technique. The decoding phase is typically memory bandwidth-bound.?

The concept of arithmetic intensity is introduced - the number of operations per byte of memory accessed. Higher intensity correlates with higher throughput. Depending on the factors influencing intensity, it may be possible to increase it to improve throughput, potentially reaching peak hardware compute limits. However, such intensity gains can adversely impact latency.

The posts apply these concepts to analyze the arithmetic intensity of Transformer decoder blocks and how factors like batch size affect intensity. The key ideas revolve around identifying performance bottlenecks based on arithmetic intensity, and applying techniques like KV caching, attention architecture variants, kernel optimizations and quantization to strike a balance between latency and throughput.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了