Running Llama-3.2-based Chatbot on Intel Core Ultra Processor using OpenVINO-GenAI
Ramesh Perumal PhD
AI Solution Architect | SMIEEE | Edge AI | Computer Vision | GenAI | MLOps | Taiwan Employment Gold Card Recipient | Healthcare & Life Sciences
This article presents the steps to quantize the Llama-3.2-3B-Instruct model using Optimum-Intel and run the chatbot using OpenVINO-GenAI on iGPU in Intel Core Ultra 7 165H Processor (Windows 11).
Steps
1. Request access to the Llama-3.2 models with your account on HuggingFace
2. Create a Python virtual environment to install the required dependencies
python -m venv ov_genai_venv
# Activate the virtual environment
.\ov_genai_venv\Scripts\activate
python -m pip install pip --upgrade
3. Install openvino-genai package
pip install openvino-genai==2024.4.0.0
3. Clone the OpenVINO-GenAI repository and install the dependencies
git clone https://github.com/openvinotoolkit/openvino.genai.git
cd openvino.genai-master\samples
pip install -r requirements.txt
4. Convert the model into intermediate representation format and quantize its precision into INT4
optimum-cli export openvino --model meta-llama/Llama-3.2-3B-Instruct --task text-generation-with-past --weight-format int4 --group-size 64 --ratio 1.0 --sym --awq --scale-estimation --dataset "wikitext2" --all-layers llama-3.2\Llama-3.2-3B-Instruct-INT4
5. Replace the tokenizer_config.json file in Llama-3.2-3B-Instruct-INT4 with this patch to run the chatbot sample
6. Run the chatbot with chat_sample.py on iGPU
#Change the device variable in chat_sample.py into GPU
device = 'GPU'
# Run the chatbot
cd python\chat_sample
python chat_sample.py ..\..\llama-3.2\Llama-3.2-3B-Instruct-INT4
Sample Output on iGPU