Running Llama-3.2-based Chatbot on Intel Core Ultra Processor using OpenVINO-GenAI

Running Llama-3.2-based Chatbot on Intel Core Ultra Processor using OpenVINO-GenAI

This article presents the steps to quantize the Llama-3.2-3B-Instruct model using Optimum-Intel and run the chatbot using OpenVINO-GenAI on iGPU in Intel Core Ultra 7 165H Processor (Windows 11).

Steps

1. Request access to the Llama-3.2 models with your account on HuggingFace

2. Create a Python virtual environment to install the required dependencies

python -m venv ov_genai_venv
# Activate the virtual environment
.\ov_genai_venv\Scripts\activate
python -m pip install pip --upgrade        

3. Install openvino-genai package

pip install openvino-genai==2024.4.0.0         

3. Clone the OpenVINO-GenAI repository and install the dependencies

git clone https://github.com/openvinotoolkit/openvino.genai.git
cd openvino.genai-master\samples
pip install -r requirements.txt        

4. Convert the model into intermediate representation format and quantize its precision into INT4

optimum-cli export openvino --model meta-llama/Llama-3.2-3B-Instruct --task text-generation-with-past --weight-format int4 --group-size 64 --ratio 1.0 --sym --awq --scale-estimation --dataset "wikitext2" --all-layers llama-3.2\Llama-3.2-3B-Instruct-INT4        

5. Replace the tokenizer_config.json file in Llama-3.2-3B-Instruct-INT4 with this patch to run the chatbot sample

6. Run the chatbot with chat_sample.py on iGPU

#Change the device variable in chat_sample.py into GPU 
device = 'GPU'
# Run the chatbot
cd python\chat_sample
python chat_sample.py ..\..\llama-3.2\Llama-3.2-3B-Instruct-INT4        

Sample Output on iGPU

References

https://medium.com/openvino-toolkit/how-to-run-llama-3-1-locally-with-openvino-45f066b3059a

要查看或添加评论,请登录

Ramesh Perumal PhD的更多文章

社区洞察

其他会员也浏览了