Maximizing AI Performance with Intel Arc A770 GPU on Windows
Synopsis
This article introduces the Intel Arc A770 GPU as a competitive option for intensive AI tasks, especially for those working within the Windows ecosystem. Traditionally, NVIDIA GPUs and CUDA have dominated this space, but Intel’s latest offering provides a robust alternative. This article adds new information to help users work more easily with the Arc A770 GPU natively on Windows, bypassing the need for the Windows Subsystem for Linux (WSL).
Through practical steps and detailed insights, we explore how to set up and optimize the Arc A770 GPU for various AI models, including Llama2, Llama3, and Phi3. The article also includes performance metrics and memory usage statistics, providing a comprehensive overview of the GPU’s capabilities. Whether you are a developer or researcher, this post will equip you with the knowledge to leverage Intel’s GPU for your AI projects efficiently and effectively.
Introduction
Intel recently provided me with the opportunity to test their Arc A770 GPU for AI tasks. While detailed specifications can be found?here , one feature immediately stood out: 16GB of RAM. This is 4GB more than its natural competitor, the NVIDIA RTX 3060, making it a compelling option for AI computations at a similar price point.
Intel Arc A770 GPU used for tests
At Plain Concepts, where we predominantly work with Microsoft technologies, I decided to explore the GPU’s capabilities on a Windows platform. Given my usual work with PyTorch, I began by utilizing the?Intel Extension for PyTorch ?to see if it could run models like Llama2, Llama3, and Phi3, and to evaluate its performance.
Initially, I considered using the Windows Subsystem for Linux (WSL) based on suggestions from various blog posts and videos that indicated native Windows support might not be fully ready. However, I chose to first experiment with a native Windows setup, and after a few tweaks and adjustments, I was pleased to discover that everything worked seamlessly!
Intel Arc A770 GPU used for tests
In this article, I will share my experiences and the steps I took to run Llama2, Llama3, and Phi3 models on the Intel Arc A770 GPU natively in Windows. I will also present performance metrics, including execution time and memory usage for each model.?The goal is to provide a comprehensive overview of how the Intel Arc A770 GPU can be effectively used for intensive AI tasks on Windows.
Setup on Windows
Intel provides a?comprehensive guide ?for installing the Python extension for the Arc GPU.
Intel extension for Pytorch install guide
However, setting up the Arc A770 GPU on Windows required some initial adjustments and troubleshooting. Here’s a brief summary of those adjustments. For detailed instructions, refer to the samples?repository .
Using the Intel extension for Pytorch
As stated in its?GitHub repository ,?“Intel? Extension for PyTorch extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware”. Specifically, it?“provides easy GPU acceleration for Intel discrete GPUs through the PyTorch xpu device”. This means that, by using this extension, you can leverage the Intel Arc A770 GPU for AI tasks without relying on CUDA/NVIDIA, and that you can get an even greater performance boost when using one of the optimized models.
Luckly, the extension follows the same API as PyTorch, so in general there is just a few changes to make in the code to get it running on the Intel GPU. Here is a brief summary of the changes needed:
Add intel extension for pytorch, and check if the GPU is correctly detected.
This change is not strictly needed, but it is a good practice to check if the GPU is correctly detected before running the model.
Once the model is loaded, move it to the GPU.
Finally, when using the model, ensure the input data is also on the GPU.
Other changes for performance measurement
In order to measure the performance accurately, I also added some extra code to retrieve the?total inference time?and?max memory allocation. It mainly consists on a warm-up of each model before actually doing de inference, plus some extra code to wait model to run and print the results in a human-readable way. Check the samples?repository ?for more information and to replicate the results in your own machine.
Llama2
Llama2 is the second iteration of the popular and open source?Llama LLM model by Meta . After the preparing the environment, and making the changes stated in the previous section to the Llama2 official samples, I was able to run the Llama2 model on the Intel Arc A770 GPU, for plain inference as well as for instruction tasks.
Running Llama2 7B on Intel Arc A770 GPU
The Llama2 7B model takes approximately 14GB of memory using float16 precision. As the GPU has 16GB available, we can run it without any issues. Below you can see the results of the inference sample, using a maximum of 128 tokens in the output.
领英推荐
Running Llama2 7B Chat on Intel Arc A770 GPU
Similarly, the Llama2 7B chat results were impressive, with the model generating human-like responses in a conversational tone. The chat sample ran smoothly on the Intel Arc A770 GPU, showcasing its capabilities for chat applications. In this case, the sample runs with 512 tokens in the output to further stress the hardware.
Llama3
Llama3 is the latest iteration of the?Llama LLM model by Meta , released a couple of months ago. Luckly the Intel team hurried to include the model optimization in the extension, so it was possible to leverage the full power of the Intel Arc A770 GPU. The process was quite similar to the one used for Llama2, using the same environment and official samples.
Running Llama3 8B on Intel Arc A770 GPU
The Llama3 8B model takes approximately a little more than 15GB of memory using float16 precision. As the GPU has 16GB available, we can run it without any issues. Below you can see the results of the inference sample, using a maximum of 64 tokens in the output.
Running Llama3 8B Instruct on Intel Arc A770 GPU
Following the Llama2 samples, I also tested for the chat capabilities of the Llama3 8B model, increasing the output tokens to 256.
Phi3
Phi3 is the latest model from Microsoft, released the 24th of April, designed for instruction tasks. It is a smaller model than Llama2 and Llama3 (3.8B parameters the smallest version), but it is still quite powerful. It is trained for instruction tasks, providing detailed and informative responses.
While Phi3 optimizations for Intel hardware are not yet included in the Intel extension for Pytorch, we can use a third party library,?ipex-llm, to optimize the model. In this case, as the Phi3 is quite new, to get the optimization I had to install the prerelease version, that implements the optimizations for all kernel operations of Phi3. Note that?ipex-llm?is not a formal Intel library, but a community-driven one, so it is not officially supported by Intel.
Once the model is optimized, the rest of the code modifications are the same as for Llama2 and Llama3, so I was able to run the Phi3 model on the Intel Arc A770 GPU without any issues.
Running Phi3 4K Instruct on Intel Arc A770 GPU
The 4K model takes around 2.5GB of memory using 4bit precision. As it has much less parameters than Llama models, it is much faster to run. Below you can see the results of the inference sample, using a maximum of 512 tokens in the output.
Performance Comparison
To offer a thorough evaluation of the Intel Arc A770 GPU’s performance, I conducted a comparative analysis of execution time and memory usage for each model on both the Intel Arc A770 GPU and the NVIDIA RTX3080 TI. The metrics were obtained using identical code samples and environment settings for both GPUs, ensuring a fair and accurate comparison.
Performance Comparison Chart
The graph below illustrates the normalized execution time per token for each model on both the Intel Arc A770 and NVIDIA RTX3080 TI GPUs.
*MARGIN OF ERROR: LESS THAN 0.1 SECONDS
As illustrated, the Intel Arc A770 GPU performed exceptionally well across all models, demonstrating competitive execution times.?Notably, the Intel Arc A770 GPU outperformed the NVIDIA RTX3080 TI by a factor of two or more in most cases.
Conclusion
The Intel Arc A770 GPU has proven to be a remarkable option for AI computation on a local Windows machine, offering an alternative to the CUDA/NVIDIA ecosystem. The GPU’s ability to efficiently run models like Llama2, Llama3, and Phi3 demonstrates its potential and robust performance capabilities. Despite initial setup challenges, the process was relatively straightforward, and the results were impressive.
In essence, the Intel Arc A770 GPU is a powerful tool for AI applications on Windows. With some initial setup and code adjustments, it handled inference, chat, and training tasks efficiently. This opens up new opportunities for developers and researchers who prefer or need to work within the Windows environment without relying on NVIDIA GPUs and CUDA. As Intel continues to enhance its GPU offerings and software support, the Arc A770 and future models are poised to become significant players in the AI community.
Useful links
The code samples used in this article can be found in the?IntelArcA770 GitHub repository .
As well below are some resources that I find fundamental to dive deeper into the Intel hardware & libraries ecosystem for AI tasks.
References
Javier Carnero | Research Manager at Plain Concepts