Deploy High-Performance Models at Scale With TensorRT and Triton Inference Server
Real-world AI models contain millions of parameters—for example, BERT, a state-of-the-art (SOTA) model for natural language processing (NLP) contains 340 million parameters—and some of these models need to run in a few milliseconds (ms). Running these complex inference requests through trained neural networks on CPU or in-framework doesn’t meet the throughput and latency requirements that modern AI applications demand.
NVIDIA TensorRT and Triton Inference Server can help your business solve this problem and provide the ability to deploy high-performance models with resilience at scale.
Take a look at our top inference sessions coming up at GTC next week. You won’t want to miss out!
Presented by: NVIDIA
Learn how to accelerate PyTorch inference without leaving the framework with Torch-TensorRT. Torch-TensorRT makes the performance of NVIDIA’s TensorRT GPU optimizations available in PyTorch for any model. You'll learn about the key capabilities of Torch-TensorRT, how to use them, and the performance benefits you can expect. We'll walk you through how to easily transition from a trained model to an inference deployment fine-tuned for your specific hardware, all with just a few lines of familiar code.
Presented by: NVIDIA
TensorRT is an SDK for high-performance deep learning inference used in production to minimize latency and maximize throughput. The latest generation of TensorRT provides a new compiler to accelerate specific workloads optimized for NVIDIA GPUs. Deep learning compilers need to have a robust method to import, optimize, and deploy models. We'll show a workflow to accelerate frameworks including PyTorch, TensorFlow, and ONNX. New users can learn about the standard workflow, while experienced users can pick up tips and tricks to optimize specific use-cases.
领英推荐
Presented by: NVIDIA, Microsoft?
Learn how Microsoft and NVIDIA are working together to simplify production deployment of AI models at scale using the Triton Inference Server and maximize inference performance using ONNX Runtime and TensorRT.
Presented by: NVIDIA
NVIDIA Triton is an open-source inference serving software that simplifies the deployment of AI models at scale in production. Deploy deep learning and machine learning models from any framework (TensorFlow, NVIDIA TensorRT, PyTorch, OpenVINO, ONNX Runtime, XGBoost, or custom) on any GPU- or CPU-based infrastructure with Triton. We'll discuss some of the new backends, support for embedded devices, new integrations on the public cloud, model ensembles, and other new features.
Presented by: Artisight, Arm, Arcturus
NVIDIA Triton Inference Server simplifies the deployment of AI models at scale in production. With the extended Arm NN custom backend, we can orchestrate multiple ML model execution and enable optimized CPU/GPU/NPU inference configurations on embedded systems, including NVIDIA's Jetson family of devices or Raspberry Pi(s). We'll introduce the Triton Inference server Arm NN backend architecture and present accelerated embedded use cases enabled with it.
Register Now to take your inference deployments to the next level.