Deploy High-Performance Models at Scale With TensorRT and Triton Inference Server

Deploy High-Performance Models at Scale With TensorRT and Triton Inference Server

Real-world AI models contain millions of parameters—for example, BERT, a state-of-the-art (SOTA) model for natural language processing (NLP) contains 340 million parameters—and some of these models need to run in a few milliseconds (ms). Running these complex inference requests through trained neural networks on CPU or in-framework doesn’t meet the throughput and latency requirements that modern AI applications demand.

NVIDIA TensorRT and Triton Inference Server can help your business solve this problem and provide the ability to deploy high-performance models with resilience at scale.

Take a look at our top inference sessions coming up at GTC next week. You won’t want to miss out!

Accelerate PyTorch Inference with TensorRT?

Presented by: NVIDIA

Learn how to accelerate PyTorch inference without leaving the framework with Torch-TensorRT. Torch-TensorRT makes the performance of NVIDIA’s TensorRT GPU optimizations available in PyTorch for any model. You'll learn about the key capabilities of Torch-TensorRT, how to use them, and the performance benefits you can expect. We'll walk you through how to easily transition from a trained model to an inference deployment fine-tuned for your specific hardware, all with just a few lines of familiar code.

Accelerate Deep Learning Inference in Production with TensorRT?

Presented by: NVIDIA

TensorRT is an SDK for high-performance deep learning inference used in production to minimize latency and maximize throughput. The latest generation of TensorRT provides a new compiler to accelerate specific workloads optimized for NVIDIA GPUs. Deep learning compilers need to have a robust method to import, optimize, and deploy models. We'll show a workflow to accelerate frameworks including PyTorch, TensorFlow, and ONNX. New users can learn about the standard workflow, while experienced users can pick up tips and tricks to optimize specific use-cases.

Deploy AI Models at Scale Using the Triton Inference Server and ONNX Runtime and Maximize Performance with TensorRT?

Presented by: NVIDIA, Microsoft?

Learn how Microsoft and NVIDIA are working together to simplify production deployment of AI models at scale using the Triton Inference Server and maximize inference performance using ONNX Runtime and TensorRT.

Maximize AI Inference Serving Performance with NVIDIA Triton Inference Server

Presented by: NVIDIA

NVIDIA Triton is an open-source inference serving software that simplifies the deployment of AI models at scale in production. Deploy deep learning and machine learning models from any framework (TensorFlow, NVIDIA TensorRT, PyTorch, OpenVINO, ONNX Runtime, XGBoost, or custom) on any GPU- or CPU-based infrastructure with Triton. We'll discuss some of the new backends, support for embedded devices, new integrations on the public cloud, model ensembles, and other new features.

Scalable, Accelerated Hardware-agnostic ML Inference with NVIDIATriton and Arm NN

Presented by: Artisight, Arm, Arcturus

NVIDIA Triton Inference Server simplifies the deployment of AI models at scale in production. With the extended Arm NN custom backend, we can orchestrate multiple ML model execution and enable optimized CPU/GPU/NPU inference configurations on embedded systems, including NVIDIA's Jetson family of devices or Raspberry Pi(s). We'll introduce the Triton Inference server Arm NN backend architecture and present accelerated embedded use cases enabled with it.

Register Now to take your inference deployments to the next level.

要查看或添加评论,请登录

Jay R.的更多文章

社区洞察

其他会员也浏览了