Cost Efficiency in IT Enterprises: Leveraging Quantization for Generative AI
Naga (Arun) Ayachitula
Vice President, AIOps Engineering (Data/Analytics & AI/ML) and Distinguished Engineer
Upendra Sharma, Arun Ayachitula
Generative AI models like GPT-4 require powerful GPUs for training and inference. High-end GPUs, such as NVIDIA GH200s, H100s, or A100s, can cost tens of thousands of dollars each. Large-scale deployments may require hundreds or thousands of these GPUs, resulting in substantial initial investments [8]. In addition to GPUs, companies might invest in specialized AI hardware like TPUs (Tensor Processing Units) provided by Google or custom AI chips designed for specific tasks. These specialized units are optimized for AI workloads and add to the capital expenditure [9]. Electricity & Cooling Costs are high for Generative AI workloads. Running and cooling AI hardware is energy intensive. Data centers hosting AI workloads consume large amounts of electricity. Training a large language model can consume as much electricity as a small town throughout the training process. The ongoing power costs for operating the servers and cooling the data center are significant [8].? Effective cooling systems are crucial to prevent overheating and maintain the performance of AI hardware. Advanced cooling solutions like liquid cooling or immersion cooling are often required, which, while efficient, are expensive to install and maintain [10].
Why Quantize?
?1.????? Lower Hardware Requirements & Energy Consumption: Quantized models require less computational power, which can significantly reduce the need for expensive high-end hardware during the development and testing phases. Quantized models consume less energy, lowering operational costs for development and test environments.
2.????? Prototype Testing: A less precise model to identify major issues or validate concepts is often sufficient during development. Quantized models are usually adequate for these purposes and help quickly iterate over prototypes.
3.????? Scalability Testing: Quantized models can help test systems' scalability without the high costs associated with running full-precision models. This is useful for assessing the performance and identifying bottlenecks before deploying the full model in production.
4.????? Faster Inference: Quantized models can perform faster inference, accelerating the testing cycle and speeding up development iterations. This is particularly useful when running extensive tests or multiple experiments.
LLM Quantization
?LLM quantization is a technique that reduces the computational and memory requirements of large language models (LLMs) by representing their weights and activations with lower-precision data types. This is particularly useful for deploying LLMs on hardware with limited resources and for improving the efficiency of inference and training processes on more powerful hardware.
Large Language Models (LLMs) have millions or billions of parameters, typically represented using 32-bit or 16-bit formats. For instance, a model with 1 billion parameters, each stored as a 32-bit floating point number, would require 4 bytes per parameter, totaling 4GB of memory. To train such a model, additional memory is needed for gradients, activations, optimizer states, and other temporary variables, which can add about 20 extra bytes per parameter. Consequently, training this model would require approximately 24GB of GPU RAM plus additional memory for data, making it unsuitable for edge devices. Quantization is a technique that reduces the precision of weights and activations in machine learning models from the standard 32-bit floating-point to lower precision formats. Common types include float16 and bfloat16, which use the same or higher precision for accumulation, and int16 and int8, which use int32 for accumulation. This approach helps to decrease the computational requirements and memory usage of models.
Here are some key points about LLM quantization:
Various calibration techniques are utilized for both post-training static quantization and quantization-aware training. The "Min-max" technique calculates the range between the minimum and maximum observed values, which is suitable for weights. The "Moving average min-max" adjusts this range using moving averages, optimizing for activations. The "Histogram" approach records values to establish a range based on criteria like entropy or mean square error to minimize quantization loss or using a percentile approach to fit a specified percentage of data within the range, which varies with the quantization type. Optimum is a toolset from huggingface.co and an extension of Transformers that provides performance optimization tools to train and run models efficiently on targeted hardware. Due to their massive size, large language models using transformers may require multiple performant GPUs, which limits their usability. GPTQ is a post-training approach, but it is primarily focused on GPU inference and performance gains.?
We used the GGUF quantization introduced by the llama.cpp team. It is a method of quantization designed for large language models. It allows users to run LLMs on a CPU while offloading some layers to the GPU, offering speed improvements. GGUF is particularly useful for those running models on CPUs or Apple devices.?
We ran our queries on two quantized versions of llama-2-1b models: an fp16-bit model and another 8b-quantized one using GGUF quantization. Find below some of the performance measurements:
Experimenting with NVIDIA TensorRT-LLM
NVIDIA TensorRT-LLM: NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes the inference performance of the latest large language models (LLMs) on the NVIDIA AI platform. NVIDIA? TensorRT? is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that deliver low latency and high throughput for inference applications[1]. It lets developers experiment with new LLMs, offering high performance and quick customization without requiring deep knowledge of C++ or CUDA.
The benefits of quantizing, specifically to 8-bit, on NVIDIA GPUS are many [4]:
?
?We have experimented with three different approaches with the unquantized version of the model along the dimensions of i) accuracy, ii) peak memory utilization, and iii) throughput: tokens per second.
?
For accuracy, we assess common LMM metrics for performance and accuracy on the MMLU dataset (Massive Multitask Language Understanding [6]). We run two scripts for each quantization: one for accuracy (mmlu.py) and one for LLM metrics (benchmark.py). From the benchmark's output, we extract peak memory utilization and throughput (tokens per second).?
The quantization algorithms adopted are i) AWQ [5], int8_sq, int8_wo, int4_sq, and int4_wo.
We have run experiments using Llama-3-8B-Instruct and followed instructions for checkpointing and converting the huggingface image a tensorrt-llm image.
The process of checkpointing and quantization has been outlined here. The actual commands executed have been shown in the appendix.
Other Quantization Toolkits
Acknowledgments
Thanks! to Tanuj Agarwal and Kim Ng from NVIDIA for the partnership!
Appendix
References:
[1] Frantar, Elias, et al. "Gptq: Accurate post-training quantization for generative pre-trained transformers."?arXiv preprint arXiv:2210.17323?(2022).
[5] AWQ: Activation-aware weight Quantization for on-device LLM compressions and acceleration https://github.com/mit-han-lab/llm-awq/blob/main/README.md
[8] Andrew Miller, The hidden costs of generative AI
Technology Advisor, Kyndryl
8 个月This article lays the foundation for another quantization value benefit related to AIOps. If you “forward think” what AIOps might become very soon, it will most likely include micro LLMs embedded directly in the server operating system. These will become “agentic” because they can perform routine administrative tasks automatically and have specific fine-tuned knowledge of the server OS they are running on. So, optimizing the LLM to run on CPUs, as discussed here, will become an enabler for higher value levels of AIOps.
Cloud Operation || Data ML & AI || Product Management || Process Management || ITIL & SIX Sigma
9 个月Insightful..