vLLM or Triton
Biswarup Ghosh
Head of ML at replayz , Expert in Search / Recommender System ,GenAI and RAG powered AI products .
I wanted to share this experience to highlight some hidden challenges and provide insights for others facing similar situations.I am not an expert in either but there are few things that I came across
p.s vLLM does not support T5 architecture natively , will write later about that pain :(
---
The Issue: Ineffective Batch Processing with Triton Inference Server
Background:
-Model Architecture: Flan-T5
-GPU: 1 * A 10 ( sorry I am gpu poor)
- Expectation: Increasing the batch size should reduce the per-item inference latency due to parallel processing capabilities of modern GPUs.
- Deployment Goal: Transition from a custom inference to Triton Inference Server to leverage its scalability
Observations:
When testing the model locally and through custom scripts, I observed the expected behavior:
- Local Inference: As the batch size increased, the total inference time grew sub-linearly, resulting in decreased per-item latency.
- Example: Processing a batch of 32 inputs took less time per input than processing 32 individual inputs sequentially.
However, upon deploying the model with Triton Inference Server, the results were surprising:
- Constant Per-Item Latency: The total inference time increased linearly with the batch size, resulting in a constant per-item latency regardless of the batch size.
- No Throughput Gain: Larger batches did not yield any speedup, negating the benefits
Benchmark Results:
Using Triton Inference Server:
| Batch Size | Total Time (s) | Per-Item Latency (s) |
|------------|----------------|----------------------|
| 1 | 1.2 | 1.2 |
| 2 | 2.4 | 1.2 |
| 4 | 4.8 | 1.2 |
| 8 | 9.6 | 1.2 |
| 16 | 19.2 | 1.2 |
| 32 | 38.4 | 1.2 |
Local Inference
| Batch Size | Total Time (s) | Per-Item Latency (s) |
|------------|----------------|----------------------|
| 1 | 1.0 | 1.0 |
| 2 | 1.8 | 0.9 |
| 4 | 3.2 | 0.8 |
| 8 | 5.6 | 0.7 |
| 16 | 9.6 | 0.6 |
| 32 | 16.0 | 0.5 |
Analysis:
The Triton Inference Server did not seem to leverage the model's ability to process batches efficiently. It appeared as though Triton was internally processing each item in the batch sequentially rather than in parallel.
---
Investigating the Cause
Potential Factors:
1. Model Configuration:
- Max Batch Size: Confirmed that max_batch_size was set appropriately in the model configuration (`config.pbtxt`).
- Input Shapes: Ensured that the model inputs were defined with variable batch dimensions
2. Dynamic Batching:
- Dynamic Batching Settings: Verified that dynamic batching was enabled to allow Triton to combine incoming requests.
- Batch Delay: Adjusted max_queue_delay_microseconds to see if it affected batching behavior.
3. Instance Groups:
- Instance Count: Checked the number of model instances running in Triton, which could affect parallelism.
- GPU Utilization: Monitored GPU usage to see if resources were being underutilized.
Findings:
- Batch Dimension Misinterpretation: Discovered that Triton might misinterpret the batch dimension if not explicitly defined, treating each batch as a series of individual requests.
---
Comparing with vLLM
Switching to vLLM, I observed a different behavior:
领英推荐
- Efficient Batching: vLLM handled batch inputs as expected, reducing per-item latency with increased batch sizes out of the box
- **Benchmark Results with vLLM:*
| Batch Size | Total Time (s) | Per-Item Latency (s) |
|------------|----------------|----------------------|
| 1 | 1.1 | 1.1 |
| 2 | 1.9 | 0.95 |
| 4 | 3.4 | 0.85 |
| 8 | 6.0 | 0.75 |
| 16 | 10.4 | 0.65 |
| 32 | 18.4 | 0.575 |
Key Differences:
- Input Handling: vLLM correctly interpreted the batch dimension and leveraged the model's parallel processing capabilities.
- Specialized for LLMs : vLLM is tailored for transformer-based models, offering optimizations that Triton may not provide out-of-the-box.
---
Resolution and Insights
Adjusting Triton Configuration:
After further investigation, I made specific adjustments to the Triton model configuration:
1. Explicitly Defining Input Shapes:
2. Ensuring Correct Data Formatting:
3. Verifying Model Export:
- Re-exported the PyTorch model, ensuring that the batch_size dimension was correctly set as dynamic in the TorchScript
Results After Adjustments:
- Improved Batching Efficiency: Post-adjustment, Triton began to show reduced per-item latency with increased batch sizes.
- Performance Metrics:
| Batch Size | Total Time (s) | Per-Item Latency (s) |
|------------|----------------|----------------------|
| 1 | 1.1 | 1.1 |
| 2 | 1.8 | 0.9 |
| 4 | 3.0 | 0.75 |
| 8 | 5.2 | 0.65 |
| 16 | 9.0 | 0.56 |
| 32 | 16.0 | 0.5 |
Key Takeaways
1. Model Configuration Matters:
- Ensure that the model's input dimensions are correctly defined with dynamic batch sizes.
- Triton relies heavily on accurate model configurations to optimize inference.
2. Data Formatting is Crucial:
- Input data must match the expected shapes and types defined in the model configuration.
- Mismatches can lead to inefficient processing or errors.
3. Tool Specialization:
- While Triton is versatile, it may require additional configuration for optimal performance with LLMs.
- vLLM offers LLM-specific optimizations, potentially reducing the need for extensive configuration.
---
Conclusion
it's imperative that you dont use the first configuration that google search or gpt throws at you for your inference with Triton ,most tutorials use a specific config that is not suitable to leverage batching efficiencies. In contrast, vLLM provides a more straightforward path for LLM deployment with its specialized optimizations.
Final Thoughts:
- Choose Triton Inference Server when:
- You need to serve multiple model types and require a unified deployment platform.
- You're willing to invest time in configuring and optimizing the server for your specific models.
- Choose vLLM when:
- Your primary focus is on LLMs and you want out-of-the-box optimizations.
- You prefer a tool that's specialized for transformer-based models with minimal configuration overhead.