Unlock the Power of Model Compression with Intel? Neural Compressor
Zac Zacharia
Lead Solution Architect - Data & AI | Cloud-Native Architectures | AI/ML Operationalization | Kafka, AWS, TensorFlow | Driving Scalable Innovation
In the rapidly evolving field of machine learning and AI, efficient model deployment is crucial. Intel? Neural Compressor is a versatile tool that offers popular model compression techniques such as quantization, pruning (sparsity), distillation, and neural architecture search. It supports mainstream frameworks like TensorFlow, PyTorch, ONNX Runtime, and MXNet, along with Intel-specific extensions such as Intel Extension for TensorFlow and Intel Extension for PyTorch.
Key Features and Benefits
Broad Hardware Support
Intel? Neural Compressor supports a wide range of Intel hardware, including:
Additionally, it offers limited support for AMD CPUs, ARM CPUs, and NVIDIA GPUs via ONNX Runtime.
Extensive Model Validation
The tool validates popular large language models (LLMs) such as LLama2, Falcon, GPT-J, Bloom, and OPT, as well as over 10,000 other models like Stable Diffusion, BERT-Large, and ResNet50. This validation is done using a zero-code optimization solution, Neural Coder, and automatic accuracy-driven quantization strategies.
Getting Started with Intel? Neural Compressor
Installation
Install the Neural Compressor from PyPI:
pip install neural-compressor
Setting Up the Environment
Set up the environment with the necessary packages:
领英推荐
pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision
Weight-Only Quantization for LLMs
Here’s a demonstration of Weight-Only Quantization for LLMs. This method supports Intel CPUs, Intel Gaudi2 AI Accelerators, and NVIDIA GPUs, automatically selecting the best device.
For Intel Gaudi2, using a Docker image with the Gaudi Software Stack is recommended. Below is the script for environment setup:
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest
# Check the container ID
docker ps
# Log into the container
docker exec -it <container_id> bash
# Install the optimum-habana
pip install --upgrade-strategy eager optimum[habana]
# Install INC/auto_round
pip install neural-compressor auto_round
Run the example:
from transformers import AutoModel, AutoTokenizer
from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor.quantization import fit
from neural_compressor.adaptor.torch_utils.auto_round import get_dataloader
model_name = "EleutherAI/gpt-neo-125m"
float_model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
dataloader = get_dataloader(tokenizer, seqlen=2048)
woq_conf = PostTrainingQuantConfig(
approach="weight_only",
op_type_dict={
".*": { # match all ops
"weight": {
"dtype": "int",
"bits": 4,
"algorithm": "AUTOROUND",
},
}
},
)
quantized_model = fit(model=float_model, conf=woq_conf, calib_dataloader=dataloader)
Note: For INT4 model inference, use Intel Extension for Transformers, which leverages Intel Neural Compressor for model quantization.
Static Quantization for Non-LLMs
Here's an example of Static Quantization using a ResNet18 model:
from torchvision import models
from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor.data import DataLoader, Datasets
from neural_compressor.quantization import fit
float_model = models.resnet18()
dataset = Datasets("pytorch")["dummy"](shape=(1, 3, 224, 224))
calib_dataloader = DataLoader(framework="pytorch", dataset=dataset)
static_quant_conf = PostTrainingQuantConfig()
quantized_model = fit(model=float_model, conf=static_quant_conf, calib_dataloader=calib_dataloader)
By leveraging Intel? Neural Compressor, you can achieve efficient model compression, enhancing performance and reducing latency across a wide range of hardware. Start optimizing your models today and unlock new potentials in AI and machine learning deployment! Github Repo