Unlock the Power of Model Compression with Intel? Neural Compressor

Unlock the Power of Model Compression with Intel? Neural Compressor

In the rapidly evolving field of machine learning and AI, efficient model deployment is crucial. Intel? Neural Compressor is a versatile tool that offers popular model compression techniques such as quantization, pruning (sparsity), distillation, and neural architecture search. It supports mainstream frameworks like TensorFlow, PyTorch, ONNX Runtime, and MXNet, along with Intel-specific extensions such as Intel Extension for TensorFlow and Intel Extension for PyTorch.

Key Features and Benefits

Broad Hardware Support

Intel? Neural Compressor supports a wide range of Intel hardware, including:

  • Intel Xeon Scalable Processors
  • Intel Xeon CPU Max Series
  • Intel Data Center GPU Flex Series
  • Intel Data Center GPU Max Series

Additionally, it offers limited support for AMD CPUs, ARM CPUs, and NVIDIA GPUs via ONNX Runtime.

Extensive Model Validation

The tool validates popular large language models (LLMs) such as LLama2, Falcon, GPT-J, Bloom, and OPT, as well as over 10,000 other models like Stable Diffusion, BERT-Large, and ResNet50. This validation is done using a zero-code optimization solution, Neural Coder, and automatic accuracy-driven quantization strategies.

Getting Started with Intel? Neural Compressor

Installation

Install the Neural Compressor from PyPI:

pip install neural-compressor        

Setting Up the Environment

Set up the environment with the necessary packages:

pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision        

Weight-Only Quantization for LLMs

Here’s a demonstration of Weight-Only Quantization for LLMs. This method supports Intel CPUs, Intel Gaudi2 AI Accelerators, and NVIDIA GPUs, automatically selecting the best device.

For Intel Gaudi2, using a Docker image with the Gaudi Software Stack is recommended. Below is the script for environment setup:

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.1:latest

# Check the container ID
docker ps

# Log into the container
docker exec -it <container_id> bash

# Install the optimum-habana
pip install --upgrade-strategy eager optimum[habana]

# Install INC/auto_round
pip install neural-compressor auto_round
        

Run the example:

from transformers import AutoModel, AutoTokenizer
from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor.quantization import fit
from neural_compressor.adaptor.torch_utils.auto_round import get_dataloader

model_name = "EleutherAI/gpt-neo-125m"
float_model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
dataloader = get_dataloader(tokenizer, seqlen=2048)

woq_conf = PostTrainingQuantConfig(
    approach="weight_only",
    op_type_dict={
        ".*": {  # match all ops
            "weight": {
                "dtype": "int",
                "bits": 4,
                "algorithm": "AUTOROUND",
            },
        }
    },
)
quantized_model = fit(model=float_model, conf=woq_conf, calib_dataloader=dataloader)
        

Note: For INT4 model inference, use Intel Extension for Transformers, which leverages Intel Neural Compressor for model quantization.

Static Quantization for Non-LLMs

Here's an example of Static Quantization using a ResNet18 model:

from torchvision import models
from neural_compressor.config import PostTrainingQuantConfig
from neural_compressor.data import DataLoader, Datasets
from neural_compressor.quantization import fit

float_model = models.resnet18()
dataset = Datasets("pytorch")["dummy"](shape=(1, 3, 224, 224))
calib_dataloader = DataLoader(framework="pytorch", dataset=dataset)
static_quant_conf = PostTrainingQuantConfig()
quantized_model = fit(model=float_model, conf=static_quant_conf, calib_dataloader=calib_dataloader)
        

By leveraging Intel? Neural Compressor, you can achieve efficient model compression, enhancing performance and reducing latency across a wide range of hardware. Start optimizing your models today and unlock new potentials in AI and machine learning deployment! Github Repo


要查看或添加评论,请登录

Zac Zacharia的更多文章

社区洞察

其他会员也浏览了