登录查看更多内容

Deploy High-Performance Models at Scale With TensorRT and Triton Inference Server

Jay R.

LLMs @ NVIDIA AI

发布日期: 2021年11月3日

Real-world AI models contain millions of parameters—for example, BERT, a state-of-the-art (SOTA) model for natural language processing (NLP) contains 340 million parameters—and some of these models need to run in a few milliseconds (ms). Running these complex inference requests through trained neural networks on CPU or in-framework doesn’t meet the throughput and latency requirements that modern AI applications demand.

NVIDIA TensorRT and Triton Inference Server can help your business solve this problem and provide the ability to deploy high-performance models with resilience at scale.

Take a look at our top inference sessions coming up at GTC next week. You won’t want to miss out!

Accelerate PyTorch Inference with TensorRT?

Presented by: NVIDIA

Learn how to accelerate PyTorch inference without leaving the framework with Torch-TensorRT. Torch-TensorRT makes the performance of NVIDIA’s TensorRT GPU optimizations available in PyTorch for any model. You'll learn about the key capabilities of Torch-TensorRT, how to use them, and the performance benefits you can expect. We'll walk you through how to easily transition from a trained model to an inference deployment fine-tuned for your specific hardware, all with just a few lines of familiar code.

Accelerate Deep Learning Inference in Production with TensorRT?

Presented by: NVIDIA

TensorRT is an SDK for high-performance deep learning inference used in production to minimize latency and maximize throughput. The latest generation of TensorRT provides a new compiler to accelerate specific workloads optimized for NVIDIA GPUs. Deep learning compilers need to have a robust method to import, optimize, and deploy models. We'll show a workflow to accelerate frameworks including PyTorch, TensorFlow, and ONNX. New users can learn about the standard workflow, while experienced users can pick up tips and tricks to optimize specific use-cases.

Deploy AI Models at Scale Using the Triton Inference Server and ONNX Runtime and Maximize Performance with TensorRT?

领英推荐

OpenAI Introduces Whisper, The Case for “Single Basin…

Lightning AI 2 年前

??Top ML Papers of the Week

DAIR.AI 1 年前

The Evolution, Mechanisms, and Applications of Machine…

Nelinia (Nel) Varenas, MBA 6 个月前

Presented by: NVIDIA, Microsoft?

Learn how Microsoft and NVIDIA are working together to simplify production deployment of AI models at scale using the Triton Inference Server and maximize inference performance using ONNX Runtime and TensorRT.

Maximize AI Inference Serving Performance with NVIDIA Triton Inference Server

Presented by: NVIDIA

NVIDIA Triton is an open-source inference serving software that simplifies the deployment of AI models at scale in production. Deploy deep learning and machine learning models from any framework (TensorFlow, NVIDIA TensorRT, PyTorch, OpenVINO, ONNX Runtime, XGBoost, or custom) on any GPU- or CPU-based infrastructure with Triton. We'll discuss some of the new backends, support for embedded devices, new integrations on the public cloud, model ensembles, and other new features.

Scalable, Accelerated Hardware-agnostic ML Inference with NVIDIATriton and Arm NN

Presented by: Artisight, Arm, Arcturus

NVIDIA Triton Inference Server simplifies the deployment of AI models at scale in production. With the extended Arm NN custom backend, we can orchestrate multiple ML model execution and enable optimized CPU/GPU/NPU inference configurations on embedded systems, including NVIDIA's Jetson family of devices or Raspberry Pi(s). We'll introduce the Triton Inference server Arm NN backend architecture and present accelerated embedded use cases enabled with it.

要查看或添加评论，请登录

Jay R.的更多文章

Scaling AI Reasoning: Key GTC 2025 Announcements for LLM Developers

2025年3月19日

Scaling AI Reasoning: Key GTC 2025 Announcements for LLM Developers

As the "Super Bowl of AI," this year's GTC highlighted significant advancements in hardware and software specifically…

1 条评论
“Wherever you go, go with all your heart.” - Confucius

2020年5月16日

“Wherever you go, go with all your heart.” - Confucius

Ever since I got to know that most of the origin of technological advancements happen in Silicon Valley in my schooling…

39 条评论
Job Search during Pandemic: Ways you can tackle it

2020年3月26日

Job Search during Pandemic: Ways you can tackle it

With 2020 batch graduation is just around the corner, the job search has never been worse for the students as well as…

13 条评论

Deploy High-Performance Models at Scale With TensorRT and Triton Inference Server

Jay R.

LLMs @ NVIDIA AI

领英推荐

Jay R.的更多文章

社区洞察

其他会员也浏览了

Revolutionizing AI Workloads with ComputerVault

Intelligence Explosion — Part 1/3

Deep Dive into Deep Learning Frameworks: A Technical Perspective

The Journey of AI: Ancient Myths to the Future

The Foundational Concepts of Turing, Von Neumann, Shannon, McCarthy, and Minsky: Their Legacy and Limitations for Modern Advanced AI

Software is Powering the AI Revolution

2-Min AI Newsletter #15

AI Research News Updates: Issue 11 (Feb 2-Feb 13, 2022)

The Story of AI: A Journey Through Data, Algorithms, and Compute

The Nvidia Gen AI LLM Certification Journey

领英推荐

Jay R.的更多文章

Scaling AI Reasoning: Key GTC 2025 Announcements for LLM Developers

“Wherever you go, go with all your heart.” - Confucius

Job Search during Pandemic: Ways you can tackle it

社区洞察

其他会员也浏览了

Revolutionizing AI Workloads with ComputerVault

Intelligence Explosion — Part 1/3

Deep Dive into Deep Learning Frameworks: A Technical Perspective

The Journey of AI: Ancient Myths to the Future

The Foundational Concepts of Turing, Von Neumann, Shannon, McCarthy, and Minsky: Their Legacy and Limitations for Modern Advanced AI

Software is Powering the AI Revolution

2-Min AI Newsletter #15

AI Research News Updates: Issue 11 (Feb 2-Feb 13, 2022)

The Story of AI: A Journey Through Data, Algorithms, and Compute

The Nvidia Gen AI LLM Certification Journey