vLLM or Triton


I wanted to share this experience to highlight some hidden challenges and provide insights for others facing similar situations.I am not an expert in either but there are few things that I came across


p.s vLLM does not support T5 architecture natively , will write later about that pain :(

---

The Issue: Ineffective Batch Processing with Triton Inference Server

Background:

-Model Architecture: Flan-T5

-GPU: 1 * A 10 ( sorry I am gpu poor)

- Expectation: Increasing the batch size should reduce the per-item inference latency due to parallel processing capabilities of modern GPUs.

- Deployment Goal: Transition from a custom inference to Triton Inference Server to leverage its scalability

Observations:

When testing the model locally and through custom scripts, I observed the expected behavior:

- Local Inference: As the batch size increased, the total inference time grew sub-linearly, resulting in decreased per-item latency.

- Example: Processing a batch of 32 inputs took less time per input than processing 32 individual inputs sequentially.

However, upon deploying the model with Triton Inference Server, the results were surprising:

- Constant Per-Item Latency: The total inference time increased linearly with the batch size, resulting in a constant per-item latency regardless of the batch size.

- No Throughput Gain: Larger batches did not yield any speedup, negating the benefits

Benchmark Results:

Using Triton Inference Server:

| Batch Size | Total Time (s) | Per-Item Latency (s) |

|------------|----------------|----------------------|

| 1 | 1.2 | 1.2 |

| 2 | 2.4 | 1.2 |

| 4 | 4.8 | 1.2 |

| 8 | 9.6 | 1.2 |

| 16 | 19.2 | 1.2 |

| 32 | 38.4 | 1.2 |

Local Inference

| Batch Size | Total Time (s) | Per-Item Latency (s) |

|------------|----------------|----------------------|

| 1 | 1.0 | 1.0 |

| 2 | 1.8 | 0.9 |

| 4 | 3.2 | 0.8 |

| 8 | 5.6 | 0.7 |

| 16 | 9.6 | 0.6 |

| 32 | 16.0 | 0.5 |

Analysis:

The Triton Inference Server did not seem to leverage the model's ability to process batches efficiently. It appeared as though Triton was internally processing each item in the batch sequentially rather than in parallel.

---

Investigating the Cause

Potential Factors:

1. Model Configuration:

- Max Batch Size: Confirmed that max_batch_size was set appropriately in the model configuration (`config.pbtxt`).

- Input Shapes: Ensured that the model inputs were defined with variable batch dimensions

2. Dynamic Batching:

- Dynamic Batching Settings: Verified that dynamic batching was enabled to allow Triton to combine incoming requests.

- Batch Delay: Adjusted max_queue_delay_microseconds to see if it affected batching behavior.

3. Instance Groups:

- Instance Count: Checked the number of model instances running in Triton, which could affect parallelism.

- GPU Utilization: Monitored GPU usage to see if resources were being underutilized.

Findings:

- Batch Dimension Misinterpretation: Discovered that Triton might misinterpret the batch dimension if not explicitly defined, treating each batch as a series of individual requests.

---

Comparing with vLLM

Switching to vLLM, I observed a different behavior:

- Efficient Batching: vLLM handled batch inputs as expected, reducing per-item latency with increased batch sizes out of the box

- **Benchmark Results with vLLM:*

| Batch Size | Total Time (s) | Per-Item Latency (s) |

|------------|----------------|----------------------|

| 1 | 1.1 | 1.1 |

| 2 | 1.9 | 0.95 |

| 4 | 3.4 | 0.85 |

| 8 | 6.0 | 0.75 |

| 16 | 10.4 | 0.65 |

| 32 | 18.4 | 0.575 |

Key Differences:

- Input Handling: vLLM correctly interpreted the batch dimension and leveraged the model's parallel processing capabilities.

- Specialized for LLMs : vLLM is tailored for transformer-based models, offering optimizations that Triton may not provide out-of-the-box.

---

Resolution and Insights

Adjusting Triton Configuration:

After further investigation, I made specific adjustments to the Triton model configuration:

1. Explicitly Defining Input Shapes:

2. Ensuring Correct Data Formatting:

3. Verifying Model Export:

- Re-exported the PyTorch model, ensuring that the batch_size dimension was correctly set as dynamic in the TorchScript

Results After Adjustments:

- Improved Batching Efficiency: Post-adjustment, Triton began to show reduced per-item latency with increased batch sizes.

- Performance Metrics:

| Batch Size | Total Time (s) | Per-Item Latency (s) |

|------------|----------------|----------------------|

| 1 | 1.1 | 1.1 |

| 2 | 1.8 | 0.9 |

| 4 | 3.0 | 0.75 |

| 8 | 5.2 | 0.65 |

| 16 | 9.0 | 0.56 |

| 32 | 16.0 | 0.5 |


Key Takeaways

1. Model Configuration Matters:

- Ensure that the model's input dimensions are correctly defined with dynamic batch sizes.

- Triton relies heavily on accurate model configurations to optimize inference.

2. Data Formatting is Crucial:

- Input data must match the expected shapes and types defined in the model configuration.

- Mismatches can lead to inefficient processing or errors.

3. Tool Specialization:

- While Triton is versatile, it may require additional configuration for optimal performance with LLMs.

- vLLM offers LLM-specific optimizations, potentially reducing the need for extensive configuration.

---

Conclusion

it's imperative that you dont use the first configuration that google search or gpt throws at you for your inference with Triton ,most tutorials use a specific config that is not suitable to leverage batching efficiencies. In contrast, vLLM provides a more straightforward path for LLM deployment with its specialized optimizations.

Final Thoughts:

- Choose Triton Inference Server when:

- You need to serve multiple model types and require a unified deployment platform.

- You're willing to invest time in configuring and optimizing the server for your specific models.

- Choose vLLM when:

- Your primary focus is on LLMs and you want out-of-the-box optimizations.

- You prefer a tool that's specialized for transformer-based models with minimal configuration overhead.

要查看或添加评论,请登录

Biswarup Ghosh的更多文章

社区洞察

其他会员也浏览了