Next-Level AI on Standard GPUs: Discover PowerInfer's Innovation in Language Model Inference
Chander D.
CEO of Cazton, Author, Microsoft AI MVP, Microsoft RD & Google Developer Expert Award
Introduction
Large Language Models (LLMs) have gained significant attention in recent years due to their remarkable capabilities in creative writing, advanced code generation, and sophisticated natural language processing tasks. However, deploying LLMs on consumer-grade GPUs presents significant challenges due to their substantial memory requirements. In this blog post, we will discuss PowerInfer, a high-speed LLM inference engine designed to efficiently serve LLMs on personal computers equipped with a single consumer-grade GPU. We will delve into the key insights, design principles, and performance evaluation of PowerInfer, which significantly outperforms existing solutions like llama.cpp while retaining model accuracy.
Key Insights into Locality in LLM Inference
PowerInfer exploits two key insights into the locality of LLM inference:
1. Power-law Activation: LLM inference exhibits a high degree of locality, with a small subset of neurons consistently contributing to the majority of activations (hot-activated neurons), while the majority of neurons are involved in the remaining activations, determined based on the inputs at runtime (cold-activated neurons).
2. Fast In-CPU Computation: If activated neurons reside in CPU memory, computing them on the CPU is faster than transferring them to the GPU, especially with the small number of activated neurons and small batch sizes typical in local deployments. Modern CPUs with vector extensions can efficiently handle such smaller matrix computations.
PowerInfer Overview
PowerInfer is an efficient LLM inference system optimized for local deployments using a single consumer-grade GPU. It exploits the locality in LLM inference by assigning the minor hot neurons to the GPU, while cold neurons, which constitute the majority, are managed by the CPU. PowerInfer preselects and preloads hot-activated neurons onto the GPU offline and leverages online predictors during runtime to identify activated neurons. This approach allows the GPU and CPU to independently process their respective sets of neurons, thereby minimizing the need for costly PCIe data transfers.
Design of PowerInfer
1. Adaptive Sparsity Predictors: PowerInfer designs an iterative training method for non-fixed-size predictors for each Transformer layer. The process begins by establishing a baseline model size based on the layer's sparsity profile. Subsequently, the model size is iteratively adjusted, taking into account the internal activation skewness to maintain accuracy.
领英推荐
2. Neuron Placement and Management: PowerInfer utilizes an offline profiler and policy solver to guide the allocation of each neuron to either the GPU or CPU based on whether the neuron is hot-activated. The online inference engine of PowerInfer loads the model into the CPU and GPU memory as per the policy.
3. GPU-CPU Hybrid Execution: PowerInfer implements a GPU-CPU hybrid execution model, wherein both units independently compute their respective activated neurons and then combine the results on the GPU. This method effectively balances the computational workload, leveraging the strengths of each unit while reducing transfer time inefficiencies.
4. Neuron-aware Operator: PowerInfer introduces neuron-aware operators that directly compute activated neurons and their weights on both GPU and CPU without the need for runtime conversion to dense format. These operators focus on individual row/column vectors within a matrix rather than the entire matrix.
Performance Evaluation
PowerInfer was evaluated on two distinct PC configurations, representing both high-end and low-end hardware scenarios. The performance evaluation reveals that PowerInfer, when deployed on a PC equipped with a single NVIDIA RTX 4090 GPU, delivers an average generation speed of 13.20 tokens/s for quantized models and 8.32 tokens/s for non-quantized models, maintaining model accuracy. These results significantly surpass llama.cpp's performance, exhibiting up to 8.00× and 11.69× improvements for quantized and non-quantized models, respectively. Significantly, the inference speed achieved on an NVIDIA RTX 4090 GPU is only 18% slower compared to the performance on a top-tier A100 GPU.
Conclusion
PowerInfer is a groundbreaking solution that bridges the gap between consumer-grade GPUs and server-grade GPUs for LLM inference. By exploiting the locality in LLM inference and designing a GPU-CPU hybrid execution model, PowerInfer significantly outperforms existing solutions like llama.cpp while retaining model accuracy. As LLMs continue to grow in size and complexity, PowerInfer's innovative approach to LLM inference will play a crucial role in making these powerful models more accessible and efficient on consumer-grade hardware.