Cost Efficiency in IT Enterprises: Leveraging Quantization for Generative AI

Cost Efficiency in IT Enterprises: Leveraging Quantization for Generative AI


Upendra Sharma, Arun Ayachitula


Generative AI models like GPT-4 require powerful GPUs for training and inference. High-end GPUs, such as NVIDIA GH200s, H100s, or A100s, can cost tens of thousands of dollars each. Large-scale deployments may require hundreds or thousands of these GPUs, resulting in substantial initial investments [8]. In addition to GPUs, companies might invest in specialized AI hardware like TPUs (Tensor Processing Units) provided by Google or custom AI chips designed for specific tasks. These specialized units are optimized for AI workloads and add to the capital expenditure [9]. Electricity & Cooling Costs are high for Generative AI workloads. Running and cooling AI hardware is energy intensive. Data centers hosting AI workloads consume large amounts of electricity. Training a large language model can consume as much electricity as a small town throughout the training process. The ongoing power costs for operating the servers and cooling the data center are significant [8].? Effective cooling systems are crucial to prevent overheating and maintain the performance of AI hardware. Advanced cooling solutions like liquid cooling or immersion cooling are often required, which, while efficient, are expensive to install and maintain [10].

Why Quantize?

?1.????? Lower Hardware Requirements & Energy Consumption: Quantized models require less computational power, which can significantly reduce the need for expensive high-end hardware during the development and testing phases. Quantized models consume less energy, lowering operational costs for development and test environments.

2.????? Prototype Testing: A less precise model to identify major issues or validate concepts is often sufficient during development. Quantized models are usually adequate for these purposes and help quickly iterate over prototypes.

3.????? Scalability Testing: Quantized models can help test systems' scalability without the high costs associated with running full-precision models. This is useful for assessing the performance and identifying bottlenecks before deploying the full model in production.

4.????? Faster Inference: Quantized models can perform faster inference, accelerating the testing cycle and speeding up development iterations. This is particularly useful when running extensive tests or multiple experiments.

LLM Quantization

?LLM quantization is a technique that reduces the computational and memory requirements of large language models (LLMs) by representing their weights and activations with lower-precision data types. This is particularly useful for deploying LLMs on hardware with limited resources and for improving the efficiency of inference and training processes on more powerful hardware.

Large Language Models (LLMs) have millions or billions of parameters, typically represented using 32-bit or 16-bit formats. For instance, a model with 1 billion parameters, each stored as a 32-bit floating point number, would require 4 bytes per parameter, totaling 4GB of memory. To train such a model, additional memory is needed for gradients, activations, optimizer states, and other temporary variables, which can add about 20 extra bytes per parameter. Consequently, training this model would require approximately 24GB of GPU RAM plus additional memory for data, making it unsuitable for edge devices. Quantization is a technique that reduces the precision of weights and activations in machine learning models from the standard 32-bit floating-point to lower precision formats. Common types include float16 and bfloat16, which use the same or higher precision for accumulation, and int16 and int8, which use int32 for accumulation. This approach helps to decrease the computational requirements and memory usage of models.


Here are some key points about LLM quantization:

  1. Precision Reduction: Typically, neural network weights and activations are stored as 32-bit floating-point numbers (FP32). Quantization reduces this precision to lower-bit formats, such as 16-bit floating-point (FP16), 8-bit integers (INT8), or even lower. This reduces the memory required to store the model and the computational resources needed for arithmetic operations.
  2. Post-Training Quantization: This approach quantizes a pre-trained model. It usually requires calibration with a representative dataset to ensure that the reduced precision does not significantly degrade the model's performance.
  3. Quantization-Aware Training: This method involves initially training the model with quantization in mind. During training, the model simulates the effects of quantization, allowing it to learn how to maintain performance despite the reduced precision.
  4. Static vs. Dynamic Quantization: In static Quantization, Weights and activations are quantized before inference begins. This requires a calibration step using a representative dataset. In dynamic Quantization, Activations are quantized on the fly during inference. This is simpler and can be applied without calibration but may not achieve as high an efficiency as static quantization.
  5. Mixed-Precision Training: This technique combines different levels of precision within the same model, typically using higher precision for critical parts of the network and lower precision elsewhere. It can strike a balance between performance and resource efficiency.
  6. Hardware Support: Effective quantization often depends on the underlying hardware's support for low-precision arithmetic operations. Modern processors, GPUs, and specialized AI accelerators increasingly offer optimized instructions for low-precision computations.
  7. Trade-offs: While quantization can significantly reduce resource requirements, it may also introduce quantization noise, potentially losing model accuracy. Careful calibration and quantization-aware training can mitigate these effects, but there is usually a trade-off between efficiency and accuracy.?

Various calibration techniques are utilized for both post-training static quantization and quantization-aware training. The "Min-max" technique calculates the range between the minimum and maximum observed values, which is suitable for weights. The "Moving average min-max" adjusts this range using moving averages, optimizing for activations. The "Histogram" approach records values to establish a range based on criteria like entropy or mean square error to minimize quantization loss or using a percentile approach to fit a specified percentage of data within the range, which varies with the quantization type. Optimum is a toolset from huggingface.co and an extension of Transformers that provides performance optimization tools to train and run models efficiently on targeted hardware. Due to their massive size, large language models using transformers may require multiple performant GPUs, which limits their usability. GPTQ is a post-training approach, but it is primarily focused on GPU inference and performance gains.?

We used the GGUF quantization introduced by the llama.cpp team. It is a method of quantization designed for large language models. It allows users to run LLMs on a CPU while offloading some layers to the GPU, offering speed improvements. GGUF is particularly useful for those running models on CPUs or Apple devices.?

We ran our queries on two quantized versions of llama-2-1b models: an fp16-bit model and another 8b-quantized one using GGUF quantization. Find below some of the performance measurements:

Experimenting with NVIDIA TensorRT-LLM

NVIDIA TensorRT-LLM: NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes the inference performance of the latest large language models (LLMs) on the NVIDIA AI platform. NVIDIA? TensorRT? is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that deliver low latency and high throughput for inference applications[1]. It lets developers experiment with new LLMs, offering high performance and quick customization without requiring deep knowledge of C++ or CUDA.

The benefits of quantizing, specifically to 8-bit, on NVIDIA GPUS are many [4]:

  • When processing 8-bit integer data, NVIDIA GPUs employ the faster and cheaper 8-bit?Tensor Cores?to compute convolution and matrix-multiplication operations. This yields more compute throughput, which is particularly effective on compute-limited layers.
  • Moving data from memory to computing elements (streaming multiprocessors in NVIDIA GPUs) takes time and energy and produces heat. Reducing the precision of activation and parameter data from 32-bit floats to 8-bit integers results in 4x data reduction, which?saves power?and reduces the produced heat.
  • Some layers are bandwidth-bound (memory-limited). That means that their implementation spends most of its time reading and writing data, and therefore, reducing their computation time does not reduce their overall runtime. Bandwidth-bound layers benefit most from reduced bandwidth requirements.
  • A reduced memory footprint means the model requires less storage space, smaller parameter updates, higher cache utilization, etc.

?

?We have experimented with three different approaches with the unquantized version of the model along the dimensions of i) accuracy, ii) peak memory utilization, and iii) throughput: tokens per second.

?

For accuracy, we assess common LMM metrics for performance and accuracy on the MMLU dataset (Massive Multitask Language Understanding [6]). We run two scripts for each quantization: one for accuracy (mmlu.py) and one for LLM metrics (benchmark.py). From the benchmark's output, we extract peak memory utilization and throughput (tokens per second).?

The quantization algorithms adopted are i) AWQ [5], int8_sq, int8_wo, int4_sq, and int4_wo.

We have run experiments using Llama-3-8B-Instruct and followed instructions for checkpointing and converting the huggingface image a tensorrt-llm image.

The process of checkpointing and quantization has been outlined here. The actual commands executed have been shown in the appendix.


Other Quantization Toolkits

  • TensorFlow Lite
  • PyTorch: PyTorch Quantization
  • ONNX Runtime: ONNX Runtime Quantization
  • Intel Neural Compressor
  • Apache TVM: Apache TVM Quantization
  • Hugging Face Optimum: Hugging Face Optimum


Acknowledgments

Thanks! to Tanuj Agarwal and Kim Ng from NVIDIA for the partnership!


Appendix

16-bit qunatization


8-bit quantization
4-bit quantization


References:

[1] Frantar, Elias, et al. "Gptq: Accurate post-training quantization for generative pre-trained transformers."?arXiv preprint arXiv:2210.17323?(2022).

[2] https://huggingface.co/blog/gptq-integration#a-gentle-summary-of-the-gptq-paper

[3] https://huggingface.co/docs/optimum/en/concept_guides/quantization

[4] https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/

[5] AWQ: Activation-aware weight Quantization for on-device LLM compressions and acceleration https://github.com/mit-han-lab/llm-awq/blob/main/README.md

[6] Massive Multitask Language Understanding

[7] Integer Quantization for deep-learning inference

[8] Andrew Miller, The hidden costs of generative AI

[9]The State of Generative AI in the Enterprise

[10] New Deloitte survey finds expectations for Gen AI remain high, but many are feeling pressure to quickly realize value while managing risks


Tony Perez

Technology Advisor, Kyndryl

8 个月

This article lays the foundation for another quantization value benefit related to AIOps. If you “forward think” what AIOps might become very soon, it will most likely include micro LLMs embedded directly in the server operating system. These will become “agentic” because they can perform routine administrative tasks automatically and have specific fine-tuned knowledge of the server OS they are running on. So, optimizing the LLM to run on CPUs, as discussed here, will become an enabler for higher value levels of AIOps.

回复
GANESH JEE

Cloud Operation || Data ML & AI || Product Management || Process Management || ITIL & SIX Sigma

9 个月

Insightful..

回复

要查看或添加评论,请登录

Naga (Arun) Ayachitula的更多文章

社区洞察

其他会员也浏览了