PTQ and QAT: Best Practices for Performing Hybrid and Selective Quantization

PTQ and QAT: Best Practices for Performing Hybrid and Selective Quantization

While quantization can solve poor runtime performance, memory and model size constraints, and hardware limitations, it has its own issues. Some architectures are difficult to quantize, accuracy can decrease, and calibration, a must-step in INT8 quantization, also has a couple of challenges.

This is where hybrid and selective quantization come in. Unlike na?ve quantization, which applies the same quantization methods for all neural network layers, hybrid and selective quantization have additional steps, resulting in better speed without the negative impact on accuracy. You can apply these approaches during post-training quantization (PTQ) and quantization-aware training (QAT).

PTQ is a quantization technique where the model is quantized after it has been trained. QAT is a finetuning of the PTQ model, where the model is further trained with quantization in mind.

Here are rules of thumb for applying hybrid and selective quantization during PTQ and QAT:

No alt text provided for this image

Now see the results on the STDC semantic segmentation model on Pascal VOC. The throughput, model size, and latency are close between na?ve quantization, and selective PTQ and QAT. But look at the accuracy with selective PTQ, there was only a very small decrease. And with selective QAT, the accuracy even got get better.

No alt text provided for this image

You can easily do hybrid and selective PTQ and QAT using SuperGradients, our open-source library for training PyTorch-based computer vision models.

To take a deeper dive into quantization and learn how you can improve your model’s speed without reducing its accuracy, watch the webinar or read the ultimate guide.


Get ahead with the latest deep learning content

  1. GPT-4 is here! It includes the ability to interact with images and longer text.
  2. Microsoft introduces Visual ChatGPT. It incorporates different visual foundation models so users can interact with ChatGPT.
  3. Train to 94% on CIFAR-10 in less than 10 seconds on a single A100. The current world record. Or ~95.77% in ~188 seconds.
  4. A playbook for systematically maximizing the performance of deep learning models. Or a long list of hyperparameters that need tuning and have relations between them so optimizing all of them must be a continuous process.
  5. Generative AI drives demand for performance tools. Along with the growth of chatbots and LLMs is the need to improve the performance of these large and complex AI models.
  6. A new open-source version of ChatGPT. OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so. Here’s another one called OpenChatKit.


Catch Deci at NVIDIA GTC 2023

No alt text provided for this image

Join our CEO & Co-Founder, Yonatan Geifman, at his GTC session on “How to Accelerate NLP Performance on GPU with Neural Architecture Search.” He’s going to take a deep dive into NLP inference performance optimization, covering the details, challenges that should be addressed, as well as tools and best practices to adopt, to achieve the best possible results without sacrificing the model’s accuracy. Register here.


Can you solve this riddle? Comment your answer below.

No alt text provided for this image

Don't forget to catch next month's newsletter to get the answer to the riddle. ??


Enjoyed these deep learning tips? Help us make our newsletter bigger and better by sharing it with your colleagues and friends!

how you select filter dynamic range after multiplication 16 bits *16 bit will be more >16

回复

要查看或添加评论,请登录

Deci AI (Acquired by NVIDIA)的更多文章

社区洞察

其他会员也浏览了