登录查看更多内容

vLLM or Triton

Biswarup Ghosh

Head of ML at replayz , Expert in Search / Recommender System ,GenAI and RAG powered AI products .

发布日期: 2024年9月18日

I wanted to share this experience to highlight some hidden challenges and provide insights for others facing similar situations.I am not an expert in either but there are few things that I came across

p.s vLLM does not support T5 architecture natively , will write later about that pain :(

---

The Issue: Ineffective Batch Processing with Triton Inference Server

Background:

-Model Architecture: Flan-T5

-GPU: 1 * A 10 ( sorry I am gpu poor)

- Expectation: Increasing the batch size should reduce the per-item inference latency due to parallel processing capabilities of modern GPUs.

- Deployment Goal: Transition from a custom inference to Triton Inference Server to leverage its scalability

Observations:

When testing the model locally and through custom scripts, I observed the expected behavior:

- Local Inference: As the batch size increased, the total inference time grew sub-linearly, resulting in decreased per-item latency.

- Example: Processing a batch of 32 inputs took less time per input than processing 32 individual inputs sequentially.

However, upon deploying the model with Triton Inference Server, the results were surprising:

- Constant Per-Item Latency: The total inference time increased linearly with the batch size, resulting in a constant per-item latency regardless of the batch size.

- No Throughput Gain: Larger batches did not yield any speedup, negating the benefits

Benchmark Results:

Using Triton Inference Server:

| Batch Size | Total Time (s) | Per-Item Latency (s) |

|------------|----------------|----------------------|

| 1 | 1.2 | 1.2 |

| 2 | 2.4 | 1.2 |

| 4 | 4.8 | 1.2 |

| 8 | 9.6 | 1.2 |

| 16 | 19.2 | 1.2 |

| 32 | 38.4 | 1.2 |

Local Inference

| Batch Size | Total Time (s) | Per-Item Latency (s) |

|------------|----------------|----------------------|

| 1 | 1.0 | 1.0 |

| 2 | 1.8 | 0.9 |

| 4 | 3.2 | 0.8 |

| 8 | 5.6 | 0.7 |

| 16 | 9.6 | 0.6 |

| 32 | 16.0 | 0.5 |

Analysis:

The Triton Inference Server did not seem to leverage the model's ability to process batches efficiently. It appeared as though Triton was internally processing each item in the batch sequentially rather than in parallel.

---

Investigating the Cause

Potential Factors:

1. Model Configuration:

- Max Batch Size: Confirmed that max_batch_size was set appropriately in the model configuration (`config.pbtxt`).

- Input Shapes: Ensured that the model inputs were defined with variable batch dimensions

2. Dynamic Batching:

- Dynamic Batching Settings: Verified that dynamic batching was enabled to allow Triton to combine incoming requests.

- Batch Delay: Adjusted max_queue_delay_microseconds to see if it affected batching behavior.

3. Instance Groups:

- Instance Count: Checked the number of model instances running in Triton, which could affect parallelism.

- GPU Utilization: Monitored GPU usage to see if resources were being underutilized.

Findings:

- Batch Dimension Misinterpretation: Discovered that Triton might misinterpret the batch dimension if not explicitly defined, treating each batch as a series of individual requests.

---

Comparing with vLLM

Switching to vLLM, I observed a different behavior:

领英推荐

GPU Fabrics for GenAI Workloads

Sharada Yeluri 1 年前

DRAM Choices Are Suddenly Much More Complicated

AKEN Cheung 封装基板制造商 11 个月前

In Network Acceleration for AI/ML Workloads

Sharada Yeluri 8 个月前

- Efficient Batching: vLLM handled batch inputs as expected, reducing per-item latency with increased batch sizes out of the box

- **Benchmark Results with vLLM:*

| Batch Size | Total Time (s) | Per-Item Latency (s) |

|------------|----------------|----------------------|

| 1 | 1.1 | 1.1 |

| 2 | 1.9 | 0.95 |

| 4 | 3.4 | 0.85 |

| 8 | 6.0 | 0.75 |

| 16 | 10.4 | 0.65 |

| 32 | 18.4 | 0.575 |

Key Differences:

- Input Handling: vLLM correctly interpreted the batch dimension and leveraged the model's parallel processing capabilities.

- Specialized for LLMs : vLLM is tailored for transformer-based models, offering optimizations that Triton may not provide out-of-the-box.

---

Resolution and Insights

Adjusting Triton Configuration:

After further investigation, I made specific adjustments to the Triton model configuration:

1. Explicitly Defining Input Shapes:

2. Ensuring Correct Data Formatting:

3. Verifying Model Export:

- Re-exported the PyTorch model, ensuring that the batch_size dimension was correctly set as dynamic in the TorchScript

Results After Adjustments:

- Improved Batching Efficiency: Post-adjustment, Triton began to show reduced per-item latency with increased batch sizes.

- Performance Metrics:

| Batch Size | Total Time (s) | Per-Item Latency (s) |

|------------|----------------|----------------------|

| 1 | 1.1 | 1.1 |

| 2 | 1.8 | 0.9 |

| 4 | 3.0 | 0.75 |

| 8 | 5.2 | 0.65 |

| 16 | 9.0 | 0.56 |

| 32 | 16.0 | 0.5 |

Key Takeaways

1. Model Configuration Matters:

- Ensure that the model's input dimensions are correctly defined with dynamic batch sizes.

- Triton relies heavily on accurate model configurations to optimize inference.

2. Data Formatting is Crucial:

- Input data must match the expected shapes and types defined in the model configuration.

- Mismatches can lead to inefficient processing or errors.

3. Tool Specialization:

- While Triton is versatile, it may require additional configuration for optimal performance with LLMs.

- vLLM offers LLM-specific optimizations, potentially reducing the need for extensive configuration.

---

Conclusion

it's imperative that you dont use the first configuration that google search or gpt throws at you for your inference with Triton ,most tutorials use a specific config that is not suitable to leverage batching efficiencies. In contrast, vLLM provides a more straightforward path for LLM deployment with its specialized optimizations.

Final Thoughts:

- Choose Triton Inference Server when:

- You need to serve multiple model types and require a unified deployment platform.

- You're willing to invest time in configuring and optimizing the server for your specific models.

- Choose vLLM when:

- Your primary focus is on LLMs and you want out-of-the-box optimizations.

- You prefer a tool that's specialized for transformer-based models with minimal configuration overhead.

要查看或添加评论，请登录

Biswarup Ghosh的更多文章

Building an AI-Native Product : Deep Dive into "not so sexy" part of AI powered applications ??

2024年9月22日

Building an AI-Native Product : Deep Dive into "not so sexy" part of AI powered applications ??

In my journey with two AI-native product teams , one of the most critical areas we've tackled is Addressing AI-Specific…
Tackling ONNX Export for a not so common Operation

2024年8月12日

Tackling ONNX Export for a not so common Operation

Last week, I hit a roadblock while working on porting MobileVLM to ONNX format. For those unfamiliar, MobileVLM is a…

vLLM or Triton

Biswarup Ghosh

Head of ML at replayz , Expert in Search / Recommender System ,GenAI and RAG powered AI products .

领英推荐

Biswarup Ghosh的更多文章

社区洞察

其他会员也浏览了

Tearing Down the Memory Wall

Web ML Monthly #17: Test client side AI models via Headless Chrome, Stable Diffusion in <1s, + Chrome mobile now supports WebGPU: run LLMs on a phone

What Is NVMe RAID Mode? Should I Have NVMe RAID Mode On

FMS 2024: Phison Showcases Award-Winning AI and Enteprise Solutions

A Comparative Analysis of H200 vs. H100 vs. A100 vs. L40S vs. L4 GPUs

Simplifying the dataflow with a switch fabric!

Large Language Models and Hardware: A Comparative Study of CPUs, GPUs, and TPUs

PCIe Equalization

A Comparative Analysis of H200 vs. H100 vs. A100 vs. L40S vs. L4 GPUs

Choosing the Right Server for Your Computer Vision Project: Key Criteria to Consider

领英推荐

Biswarup Ghosh的更多文章

Building an AI-Native Product : Deep Dive into "not so sexy" part of AI powered applications ??

Tackling ONNX Export for a not so common Operation

社区洞察

其他会员也浏览了

Tearing Down the Memory Wall

Web ML Monthly #17: Test client side AI models via Headless Chrome, Stable Diffusion in <1s, + Chrome mobile now supports WebGPU: run LLMs on a phone

What Is NVMe RAID Mode? Should I Have NVMe RAID Mode On

FMS 2024: Phison Showcases Award-Winning AI and Enteprise Solutions

A Comparative Analysis of H200 vs. H100 vs. A100 vs. L40S vs. L4 GPUs

Simplifying the dataflow with a switch fabric!

Large Language Models and Hardware: A Comparative Study of CPUs, GPUs, and TPUs

PCIe Equalization

A Comparative Analysis of H200 vs. H100 vs. A100 vs. L40S vs. L4 GPUs

Choosing the Right Server for Your Computer Vision Project: Key Criteria to Consider