登录查看更多内容

vLLM: Efficient Caching for Large Language Model Serving

Vijayakumar Ramdoss↗?

Analyst | Engineer | Architect

发布日期: 2025年2月23日

Disclaimer:?the opinions I share are solely my own and do not reflect those of my employer

Large Language Models (LLMs) are increasingly prevalent, but deploying them efficiently can be complex. vLLM simplifies this process with a user-friendly library that optimizes memory use and maximizes throughput. This article explores vLLM's architecture, setup, and deployment, demonstrating how it empowers developers to harness the power of LLMs easily.

Initially developed in the Sky Computing Lab at UC Berkeley, it has transformed into a community-driven initiative. vLLM core architecture and key features are discussed below the paper,

"Efficient Memory Management for Large Language Model Serving with PagedAttention" paper published at 2023 Sep - https://arxiv.org/abs/2309.06180

In similar terms regarding describing vLLM

Imagine your brain is like a computer; you use it to understand and talk about things. Now, think about?huge brains, like super-smart computers, that can do amazing things like write stories or answer any question you ask. These are called?Large Language Models, or LLMs?for short. Now,?vLLM?is a special tool that helps these big brains?think and respond much faster and more efficiently. It's like giving them a supercharger!

Here's an example:

Let's say you want one of these super-smart computers to write a poem about your cat. Without vLLM, it might take a little while to come up with the poem. But with vLLM, it's like the computer gets a turbo boost, and the poem is ready much faster. It's like the difference between walking to the store and driving a car – vLLM gets you there much quicker!

How does it work?

vLLM utilizes sophisticated techniques like paged attention. This approach allows vLLM to efficiently handle long text sequences and numerous simultaneous requests by optimizing the usage of the KV Cache. vLLM creates different logical KV blocks for a request and fills them from left to right as the new cache is generated. The KV cache manager maintains block tables that map the logical and physical addresses of the KV blocks for each request. This is similar to how operating systems handle virtual memory and paging.

Core Architecture and Key Features

PagedAttention: vLLM uses PagedAttention to manage memory usage and efficiently serve large language models. This technique avoids attention scores for every previous token at each step in autoregressive generation by storing key-value tensors more efficiently in GPU VRAM. PagedAttention creates contiguous virtual blocks mapped to physical blocks in the GPU memory, similar to how operating systems manage virtual memory. This allows vLLM to efficiently handle long text sequences and numerous simultaneous requests by optimizing KV cache usage.
Continuous Batching: vLLM supports dynamic request batching for higher throughput. vLLM can preempt requests to free up KV cache space for other requests and recomputes preempted requests when sufficient KV cache space becomes available.
Quantization: vLLM supports quantization methods like GPTQ, AWQ, INT4, INT8, and FP8, which help reduce memory usage and improve performance.
CUDA/HIP Graph: vLLM enables fast model execution with CUDA/HIP graph.
Optimized CUDA Kernels: vLLM includes optimized CUDA kernels and integration with FlashAttention and FlashInfer for faster performance.
Prefix Caching: vLLM has essential support for prefix caching, which allows common prefixes to be cached for different requests so that new requests can directly use the cached prefix without recomputation. vLLM is also working on automatic prefix caching, which will automatically find opportunities to reuse the cache of previous requests.

vLLM's architecture components

LLM Class: This class provides the primary Python interface for offline inference.
OpenAI-Compatible API Server: vLLM offers an OpenAI-compatible API server that can be started using the vllm serve command.
LLMEngine and AsyncLLMEngine: These classes are central to vLLM, handling model inference and asynchronous request processing. The LLMEngine is responsible for receiving requests and generating outputs while AsyncLLMEngine is an asynchronous wrapper designed for online serving.
Worker: A worker is a process that runs the model inference. One worker process controls one accelerator device (e.g., GPU).
Model Runner: Each worker has a model runner object responsible for loading and running the model.
Model: Every model runner object has one model object: the actual torch.nn.Module instance.

Setup and Running on vLLM for simple setup

Environment Setup:

Ensure you have installed Python 3.7 or higher.

Install necessary libraries:

pip install torch transformers

Hardware Requirements:

Please make sure you have access to a GPU (NVIDIA preferred) since vLLM is optimized for GPU performance.

Clone vLLM Repository:

Clone the vLLM GitHub repository

git clone https://github.com/vllm-project/vllm.git
cd vllm

Install vLLM

Please navigate to the cloned repository and install it

pip install .

Start the Inference Server

You can start the vLLM inference server for a specific model. For example, to use the GPT-2 model:

vllm serve --model gpt2

Send Requests to the Server

The server will run locally on a specified port (the default is usually 8080). You can send requests using a simple Python script or curl.

领英推荐

Practical Strategies to Enhance LLMs Performance!

Pavan Belagatti 9 个月前

?? Massive Progress in Reasoning Models

Pascal Biese 1 个月前

A Guide to Building RAG

Francesca Tabor 11 个月前

Here's a sample Python script to send a request to the vLLM server:

import requests
import json

# Set the server URL
server_url = 'https://localhost:8080/generate'

# Define the payload for the request
payload = {
    "prompt": "Once upon a time,",
    "max_length": 50,
    "num_return_sequences": 1
}

# Send a POST request to the vLLM server
response = requests.post(server_url, json=payload)

# Check if the request was successful
if response.status_code == 200:
    generated_text = response.json()
    print("Generated Text:", generated_text)
else:
    print("Error:", response.status_code, response.text)

Accessing Results:

The generated text will be returned in the response. You can extract and use this text as needed.

You can adjust various parameters, such as max_length, num_return_sequences, and others in your payload to customize the output according to your needs.

Monitor GPU usage and optimize requests to get the best performance without running into memory issues.

Deploying vLLM

vLLM can be deployed in various environments, including:

Local Deployment: vLLM can be run locally to serve models. Please make sure the GPU meets the minimum compute capability requirements.
Docker:vLLM offers Docker support for Deployment. This ensures consistent performance across different environments.
Kubernetes:vLLM can be deployed using Kubernetes for scalable and reliable LLM serving.
Cloud Deployment: vLLM can be deployed on any Cloud Infra. This setup can improve inference efficiency and reduce costs.

Comparing architecture and design with other frameworks

vLLM

Architecture:?vLLM is designed for efficiency with large language models. It utilizes features like zero-copy inference and shared memory to reduce memory overhead.
Design:?It automatically supports dynamic batching, allowing multiple requests to be processed in a single batch, which speeds up inference time.
Ideal for: Research environments or production scenarios needing to serve multiple requests simultaneously while efficiently utilizing GPU resources.

Hugging Face Transformers

Architecture:?This modular library makes it easy to load different transformer models. It's built with Python and provides a simple API for working with various models.
Design: It's designed to facilitate a "one-stop-shop" for NLP tasks, allowing you to train, fine-tune, and serve models with just a few lines of code.
It is ideal for Prototyping, academic research, and applications that require diverse NLP models, such as sentiment analysis or text generation.

TensorFlow Serving

Architecture:?It is built to deploy TensorFlow models, focusing on robustness and high performance. It supports gRPC and RESTful APIs.
Design: It enables version control for models, allowing you to seamlessly roll back to previous versions and manage multiple models.
Ideal for: TensorFlow models that need to be integrated into more extensive applications, especially those requiring real-time inference and scalability.

Triton Inference Server

Architecture: Triton is framework-agnostic, meaning it can serve models from various frameworks (like TensorFlow, PyTorch, etc.) using a single server.
Design: It offers features like dynamic batching and ensemble models, allowing different models to work together efficiently.
Ideal for: Complex production environments needing efficient inference for multiple model types for serving models across different frameworks.

Ollama

Architecture:?Olama is designed for local model deployment. Its streamlined command-line interface focuses on simplicity and ease of use.
Design: It allows users to run and interact with language models locally without extensive setup or configuration.
It is ideal for?Developers who want to quickly run and interact with language models without extensive setup, especially in localized or development environments.

AIBrix: A Scalable, Effective Control Plane for vLLM

AIBrix is a toolkit that helps vLLM work better, especially when running, and significantly improves the performance of real-world situations. Think of AIBrix as a control panel for vLLM, making it more Actslable and cost-effective.

Here's how AIBrix helps vLLM, explained:

Manages LoRA models efficiently: AIBrix helps load and unload LoRA models dynamically, optimizing the use of computing resources and reducing costs.
Improves traffic management: It uses an advanced LLM gateway to send user requests smartly, considering factors like token patterns and memory usage, which reduces delays.
Acts as a bridge: AIBrix provides a unified AI runtime that allows different components to communicate seamlessly, making it easier to manage models and resources.
Optimizes autoscaling: It adjusts computing resources automatically based on the workload, ensuring efficient performance and reducing latency.
Enhances memory usage: AIBrix uses a distributed KV cache to optimize network and memory efficiency, improving overall performance.
Provides tools for various scenarios: It offers features like a GPU optimizer for heterogeneous GPU serving and diagnostic tools for identifying and addressing issues.

In summary, vLLM emerges as a powerful solution for addressing the challenges of deploying and serving Large Language Models (LLMs). Through its innovative architecture, including PagedAttention for efficient memory management and dynamic batching for higher throughput, vLLM optimizes resource utilization and enhances inference speed,

Reference:

vLLM: https://blog.vllm.ai/

Code Repo: https://github.com/vllm-project

要查看或添加评论，请登录

Vijayakumar Ramdoss↗?的更多文章

Understanding Memory in LLM and AI Agents

2025年3月16日

Understanding Memory in LLM and AI Agents

Disclaimer: the opinions I share are solely my own and do not reflect those of my employer. In the fast-changing world…

3 条评论
HyDE - Overview of Hypothetical Document Embeddings

2025年3月9日

HyDE - Overview of Hypothetical Document Embeddings

Disclaimer: the opinions I share are solely my own and do not reflect those of my employer. In Natural Language…
GraphRAG: Enhancing LLMs with Knowledge Graphs

2025年3月2日

GraphRAG: Enhancing LLMs with Knowledge Graphs

Disclaimer: the opinions I share are solely my own and do not reflect those of my employer. Traditional…

1 条评论
ReAct: Teaching AI to Think and Act Like Us (But for Real!)

2025年2月16日

ReAct: Teaching AI to Think and Act Like Us (But for Real!)

The paper "ReAct: Synergizing Reasoning and Acting in Language Models" was published in ICLR 2023. Paper URL:…
Design of a High-Performance Large Language Model Platform Foundation.

2025年2月9日

Design of a High-Performance Large Language Model Platform Foundation.

Disclaimer: the opinions I share are solely my own and do not reflect those of my employer. This article discusses the…

1 条评论
Multi-Agent Collaboration for Long-Context Tasks: The Chain-of-Agents(CoA) Approach

2025年2月2日

Multi-Agent Collaboration for Long-Context Tasks: The Chain-of-Agents(CoA) Approach

Disclaimer: the opinions I share are solely my own and do not reflect those of my employer. Have you ever tried to read…
Unlocking the Power Of Chain of Thought (CoT), Reinforcement Learning (RL), and Model Distillation.

2025年1月26日

Unlocking the Power Of Chain of Thought (CoT), Reinforcement Learning (RL), and Model Distillation.

Disclaimer: the opinions I share are solely my own and do not reflect those of my employer. Unlocking the power of…
Reinforcement Learning and Its Latest Development.

2025年1月26日

Reinforcement Learning and Its Latest Development.

Disclaimer: the opinions I share are solely my own and do not reflect those of my employer. What is Reinforcement…
RAG (Retrieval-Augmented Generation) Best Practices

2025年1月20日

RAG (Retrieval-Augmented Generation) Best Practices

Disclaimer: the opinions I share are solely my own and do not reflect those of my employer. RAG (Retrieval-Augmented…
What’s Next for Deep Learning?

2017年1月24日

What’s Next for Deep Learning?

According to AI/DL pioneer's what will be next in the Deep Learning, Ilya Sutskever, Research Director of OpenAI:…

See all articles

vLLM: Efficient Caching for Large Language Model Serving

Vijayakumar Ramdoss↗?

Analyst | Engineer | Architect

In similar terms regarding describing vLLM

How does it work?

Core Architecture and Key Features

vLLM's architecture components

Setup and Running on vLLM for simple setup

领英推荐

Deploying vLLM

Comparing architecture and design with other frameworks

Hugging Face Transformers

TensorFlow Serving

Triton Inference Server

Ollama

AIBrix: A Scalable, Effective Control Plane for vLLM

Reference:

Vijayakumar Ramdoss↗?的更多文章

社区洞察

其他会员也浏览了

Building a VM with Native ZK Proof Generation in?Rust

OpenAI’s o3?mini: A Masterstroke or a Market Manipulation? The ROI Gamble That’s Rattling Boardrooms

From Prompt to Profit: How AI-Driven Quantum Ecosystems Are Revolutionizing Enterprise Software

Optimizing LLMs: The Dynamic Integration of LangChain and GPTCache

RAG Architecture Deep Dive

Why GraphQL Will Rewrite the Semantic Web

The Power of Polynomial Computations and Deterministic Boolean Algorithms in Real-World Applications

LLM fine-tuning and model selection + other resources

My Learnings from CS 242: Information Retrieval & Web Search

The Future of C++ in High-Performance Application Development ??

In similar terms regarding describing vLLM

How does it work?

Core Architecture and Key Features

vLLM's architecture components

Setup and Running on vLLM for simple setup

领英推荐

Deploying vLLM

Comparing architecture and design with other frameworks

Hugging Face Transformers

TensorFlow Serving

Triton Inference Server

Ollama

AIBrix: A Scalable, Effective Control Plane for vLLM

Reference:

Vijayakumar Ramdoss↗?的更多文章

Understanding Memory in LLM and AI Agents

HyDE - Overview of Hypothetical Document Embeddings

GraphRAG: Enhancing LLMs with Knowledge Graphs

ReAct: Teaching AI to Think and Act Like Us (But for Real!)

Design of a High-Performance Large Language Model Platform Foundation.

Multi-Agent Collaboration for Long-Context Tasks: The Chain-of-Agents(CoA) Approach

Unlocking the Power Of Chain of Thought (CoT), Reinforcement Learning (RL), and Model Distillation.

Reinforcement Learning and Its Latest Development.

RAG (Retrieval-Augmented Generation) Best Practices

What’s Next for Deep Learning?

社区洞察

其他会员也浏览了

Building a VM with Native ZK Proof Generation in?Rust

OpenAI’s o3?mini: A Masterstroke or a Market Manipulation? The ROI Gamble That’s Rattling Boardrooms

From Prompt to Profit: How AI-Driven Quantum Ecosystems Are Revolutionizing Enterprise Software

Optimizing LLMs: The Dynamic Integration of LangChain and GPTCache

RAG Architecture Deep Dive

Why GraphQL Will Rewrite the Semantic Web

The Power of Polynomial Computations and Deterministic Boolean Algorithms in Real-World Applications

LLM fine-tuning and model selection + other resources

My Learnings from CS 242: Information Retrieval & Web Search

The Future of C++ in High-Performance Application Development ??