登录查看更多内容

Cost Efficiency in IT Enterprises: Leveraging Quantization for Generative AI

Naga (Arun) Ayachitula

Vice President, AIOps Engineering (Data/Analytics & AI/ML) and Distinguished Engineer

发布日期: 2024年6月2日

Upendra Sharma, Arun Ayachitula

Generative AI models like GPT-4 require powerful GPUs for training and inference. High-end GPUs, such as NVIDIA GH200s, H100s, or A100s, can cost tens of thousands of dollars each. Large-scale deployments may require hundreds or thousands of these GPUs, resulting in substantial initial investments [8]. In addition to GPUs, companies might invest in specialized AI hardware like TPUs (Tensor Processing Units) provided by Google or custom AI chips designed for specific tasks. These specialized units are optimized for AI workloads and add to the capital expenditure [9]. Electricity & Cooling Costs are high for Generative AI workloads. Running and cooling AI hardware is energy intensive. Data centers hosting AI workloads consume large amounts of electricity. Training a large language model can consume as much electricity as a small town throughout the training process. The ongoing power costs for operating the servers and cooling the data center are significant [8].? Effective cooling systems are crucial to prevent overheating and maintain the performance of AI hardware. Advanced cooling solutions like liquid cooling or immersion cooling are often required, which, while efficient, are expensive to install and maintain [10].

Why Quantize?

?1.????? Lower Hardware Requirements & Energy Consumption: Quantized models require less computational power, which can significantly reduce the need for expensive high-end hardware during the development and testing phases. Quantized models consume less energy, lowering operational costs for development and test environments.

2.????? Prototype Testing: A less precise model to identify major issues or validate concepts is often sufficient during development. Quantized models are usually adequate for these purposes and help quickly iterate over prototypes.

3.????? Scalability Testing: Quantized models can help test systems' scalability without the high costs associated with running full-precision models. This is useful for assessing the performance and identifying bottlenecks before deploying the full model in production.

4.????? Faster Inference: Quantized models can perform faster inference, accelerating the testing cycle and speeding up development iterations. This is particularly useful when running extensive tests or multiple experiments.

LLM Quantization

?LLM quantization is a technique that reduces the computational and memory requirements of large language models (LLMs) by representing their weights and activations with lower-precision data types. This is particularly useful for deploying LLMs on hardware with limited resources and for improving the efficiency of inference and training processes on more powerful hardware.

Large Language Models (LLMs) have millions or billions of parameters, typically represented using 32-bit or 16-bit formats. For instance, a model with 1 billion parameters, each stored as a 32-bit floating point number, would require 4 bytes per parameter, totaling 4GB of memory. To train such a model, additional memory is needed for gradients, activations, optimizer states, and other temporary variables, which can add about 20 extra bytes per parameter. Consequently, training this model would require approximately 24GB of GPU RAM plus additional memory for data, making it unsuitable for edge devices. Quantization is a technique that reduces the precision of weights and activations in machine learning models from the standard 32-bit floating-point to lower precision formats. Common types include float16 and bfloat16, which use the same or higher precision for accumulation, and int16 and int8, which use int32 for accumulation. This approach helps to decrease the computational requirements and memory usage of models.

Here are some key points about LLM quantization:

Precision Reduction: Typically, neural network weights and activations are stored as 32-bit floating-point numbers (FP32). Quantization reduces this precision to lower-bit formats, such as 16-bit floating-point (FP16), 8-bit integers (INT8), or even lower. This reduces the memory required to store the model and the computational resources needed for arithmetic operations.
Post-Training Quantization: This approach quantizes a pre-trained model. It usually requires calibration with a representative dataset to ensure that the reduced precision does not significantly degrade the model's performance.
Quantization-Aware Training: This method involves initially training the model with quantization in mind. During training, the model simulates the effects of quantization, allowing it to learn how to maintain performance despite the reduced precision.
Static vs. Dynamic Quantization: In static Quantization, Weights and activations are quantized before inference begins. This requires a calibration step using a representative dataset. In dynamic Quantization, Activations are quantized on the fly during inference. This is simpler and can be applied without calibration but may not achieve as high an efficiency as static quantization.
Mixed-Precision Training: This technique combines different levels of precision within the same model, typically using higher precision for critical parts of the network and lower precision elsewhere. It can strike a balance between performance and resource efficiency.
Hardware Support: Effective quantization often depends on the underlying hardware's support for low-precision arithmetic operations. Modern processors, GPUs, and specialized AI accelerators increasingly offer optimized instructions for low-precision computations.
Trade-offs: While quantization can significantly reduce resource requirements, it may also introduce quantization noise, potentially losing model accuracy. Careful calibration and quantization-aware training can mitigate these effects, but there is usually a trade-off between efficiency and accuracy.?

Various calibration techniques are utilized for both post-training static quantization and quantization-aware training. The "Min-max" technique calculates the range between the minimum and maximum observed values, which is suitable for weights. The "Moving average min-max" adjusts this range using moving averages, optimizing for activations. The "Histogram" approach records values to establish a range based on criteria like entropy or mean square error to minimize quantization loss or using a percentile approach to fit a specified percentage of data within the range, which varies with the quantization type. Optimum is a toolset from huggingface.co and an extension of Transformers that provides performance optimization tools to train and run models efficiently on targeted hardware. Due to their massive size, large language models using transformers may require multiple performant GPUs, which limits their usability. GPTQ is a post-training approach, but it is primarily focused on GPU inference and performance gains.?

We used the GGUF quantization introduced by the llama.cpp team. It is a method of quantization designed for large language models. It allows users to run LLMs on a CPU while offloading some layers to the GPU, offering speed improvements. GGUF is particularly useful for those running models on CPUs or Apple devices.?

We ran our queries on two quantized versions of llama-2-1b models: an fp16-bit model and another 8b-quantized one using GGUF quantization. Find below some of the performance measurements:

Experimenting with NVIDIA TensorRT-LLM

NVIDIA TensorRT-LLM: NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes the inference performance of the latest large language models (LLMs) on the NVIDIA AI platform. NVIDIA? TensorRT? is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that deliver low latency and high throughput for inference applications[1]. It lets developers experiment with new LLMs, offering high performance and quick customization without requiring deep knowledge of C++ or CUDA.

The benefits of quantizing, specifically to 8-bit, on NVIDIA GPUS are many [4]:

When processing 8-bit integer data, NVIDIA GPUs employ the faster and cheaper 8-bit?Tensor Cores?to compute convolution and matrix-multiplication operations. This yields more compute throughput, which is particularly effective on compute-limited layers.
Moving data from memory to computing elements (streaming multiprocessors in NVIDIA GPUs) takes time and energy and produces heat. Reducing the precision of activation and parameter data from 32-bit floats to 8-bit integers results in 4x data reduction, which?saves power?and reduces the produced heat.
Some layers are bandwidth-bound (memory-limited). That means that their implementation spends most of its time reading and writing data, and therefore, reducing their computation time does not reduce their overall runtime. Bandwidth-bound layers benefit most from reduced bandwidth requirements.
A reduced memory footprint means the model requires less storage space, smaller parameter updates, higher cache utilization, etc.

?We have experimented with three different approaches with the unquantized version of the model along the dimensions of i) accuracy, ii) peak memory utilization, and iii) throughput: tokens per second.

For accuracy, we assess common LMM metrics for performance and accuracy on the MMLU dataset (Massive Multitask Language Understanding [6]). We run two scripts for each quantization: one for accuracy (mmlu.py) and one for LLM metrics (benchmark.py). From the benchmark's output, we extract peak memory utilization and throughput (tokens per second).?

The quantization algorithms adopted are i) AWQ [5], int8_sq, int8_wo, int4_sq, and int4_wo.

领英推荐

Explore AI News & Applications, and Shape the Future…

We have run experiments using Llama-3-8B-Instruct and followed instructions for checkpointing and converting the huggingface image a tensorrt-llm image.

The process of checkpointing and quantization has been outlined here. The actual commands executed have been shown in the appendix.

Other Quantization Toolkits

TensorFlow Lite
PyTorch: PyTorch Quantization
ONNX Runtime: ONNX Runtime Quantization
Intel Neural Compressor
Apache TVM: Apache TVM Quantization
Hugging Face Optimum: Hugging Face Optimum

Acknowledgments

Thanks! to Tanuj Agarwal and Kim Ng from NVIDIA for the partnership!

Appendix

References:

[1] Frantar, Elias, et al. "Gptq: Accurate post-training quantization for generative pre-trained transformers."?arXiv preprint arXiv:2210.17323?(2022).

[2] https://huggingface.co/blog/gptq-integration#a-gentle-summary-of-the-gptq-paper

[3] https://huggingface.co/docs/optimum/en/concept_guides/quantization

[4] https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/

[5] AWQ: Activation-aware weight Quantization for on-device LLM compressions and acceleration https://github.com/mit-han-lab/llm-awq/blob/main/README.md

[6] Massive Multitask Language Understanding

[7] Integer Quantization for deep-learning inference

[8] Andrew Miller, The hidden costs of generative AI

[9]The State of Generative AI in the Enterprise

[10] New Deloitte survey finds expectations for Gen AI remain high, but many are feeling pressure to quickly realize value while managing risks

Tony Perez

Technology Advisor, Kyndryl

8 个月

This article lays the foundation for another quantization value benefit related to AIOps. If you “forward think” what AIOps might become very soon, it will most likely include micro LLMs embedded directly in the server operating system. These will become “agentic” because they can perform routine administrative tasks automatically and have specific fine-tuned knowledge of the server OS they are running on. So, optimizing the LLM to run on CPUs, as discussed here, will become an enabler for higher value levels of AIOps.

GANESH JEE

Cloud Operation || Data ML & AI || Product Management || Process Management || ITIL & SIX Sigma

9 个月

Insightful..

查看更多评论

要查看或添加评论，请登录

Naga (Arun) Ayachitula的更多文章

How Intel Gaudi-2 Optimizations Drive Significant Cost Savings

2025年1月31日

How Intel Gaudi-2 Optimizations Drive Significant Cost Savings

Upendra Sharma*, Arun Ayachitula Generative AI is transforming industries, but its soaring costs demand more innovative…

2 条评论
AIOps experiments on the NVIDIA GH200 Grace Hopper?

2024年4月13日

AIOps experiments on the NVIDIA GH200 Grace Hopper?

Balakrishnan Saravanan Kesavan, Upendra Sharma, and Arun Ayachitula Numerous intricate Natural Language Processing…
Integrated AIOps - IT Service Health

2024年3月24日

Integrated AIOps - IT Service Health

Upendra Sharma, Girish Mohite & Arun Ayachitula Service Health is a multifaceted health monitoring system for IT…

1 条评论
Text Similarity

2023年12月30日

Text Similarity

Upendra Sharma, Arun Ayachitula 1. Motivation While adept at storing factual knowledge and excelling in NLP tasks…
AIOps: Forecasting with data drift considerations

2023年12月23日

AIOps: Forecasting with data drift considerations

Balakrishnan Saravanan Kesavan, Upendra Sharma, and Arun Ayachitula It is crucial to monitor and forecast IT…
AIOps: Time series analysis – Forecasting

2023年12月22日

AIOps: Time series analysis – Forecasting

Balakrishnan Saravanan Kesavan, Upendra Sharma and Arun Ayachitula Measures by Business Objectives (MBOs) in IT Service…
AIOps: Interpretability using Disjunctive Normal Form

2023年2月15日

AIOps: Interpretability using Disjunctive Normal Form

Arun Ayachitula & Upendra Sharma Interpretability and Explainability of AI/ML models have been differentiated in the…
AIOps - Explainability using pertinent positives

2023年2月12日

AIOps - Explainability using pertinent positives

Arun Ayachitula, Rohit Khandekar & Upendra Sharma Classifier Explainability is a Broad AI practice to explain the…

1 条评论
AIOps: Biases and Fairness in AI/ML

2023年1月26日

AIOps: Biases and Fairness in AI/ML

Arun Ayachitula, Rohit Khandekar & Upendra Sharma Fairness in AI has received much attention recently due to ethical…
AIOps - Multimodal Correlation & Prescriptive insights from analyzing multiple IT datasets

2023年1月20日

AIOps - Multimodal Correlation & Prescriptive insights from analyzing multiple IT datasets

Arun Ayachitula, Rohit Khandekar & Upendra Sharma We will explore capabilities in AIOps from correlating time-series…

3 条评论

See all articles

Cost Efficiency in IT Enterprises: Leveraging Quantization for Generative AI

Naga (Arun) Ayachitula

Vice President, AIOps Engineering (Data/Analytics & AI/ML) and Distinguished Engineer

Why Quantize?

LLM Quantization

Experimenting with NVIDIA TensorRT-LLM

领英推荐

Other Quantization Toolkits

Acknowledgments

Appendix

References:

Naga (Arun) Ayachitula的更多文章

社区洞察

其他会员也浏览了

Latest AI, Crypto Trends, Insights and News Headlines for October 9, 2024

Analysing DeepSeek’s Threat to American AI Companies

The rise of AI agents

Letter from guest editor

DeepSeekv3 Crushes Closed-Source LLMs

ConnectingAI #78: The AI revolution: Why hardware matters more than ever for government and more

DeepSeek: The Paradigm Shift in AI No One Saw Coming

Where is AI going in 5 years?

DeepSeek-R1 : A new benchmark in efficiency

Why Quantize?

LLM Quantization

Experimenting with NVIDIA TensorRT-LLM

领英推荐

Other Quantization Toolkits

Acknowledgments

Appendix

References:

Naga (Arun) Ayachitula的更多文章

How Intel Gaudi-2 Optimizations Drive Significant Cost Savings

AIOps experiments on the NVIDIA GH200 Grace Hopper?

Integrated AIOps - IT Service Health

Text Similarity

AIOps: Forecasting with data drift considerations

AIOps: Time series analysis – Forecasting

AIOps: Interpretability using Disjunctive Normal Form

AIOps - Explainability using pertinent positives

AIOps: Biases and Fairness in AI/ML

AIOps - Multimodal Correlation & Prescriptive insights from analyzing multiple IT datasets

社区洞察

其他会员也浏览了

Latest AI, Crypto Trends, Insights and News Headlines for October 9, 2024

Analysing DeepSeek’s Threat to American AI Companies

The rise of AI agents

Letter from guest editor

DeepSeekv3 Crushes Closed-Source LLMs

ConnectingAI #78: The AI revolution: Why hardware matters more than ever for government and more

DeepSeek: The Paradigm Shift in AI No One Saw Coming

Where is AI going in 5 years?

DeepSeek-R1 : A new benchmark in efficiency