登录查看更多内容

A step-by-step guide to install Intel Advisor and analyze a sample application and find out where Vectorization matters the most

Arun GK

Backend Developer | Intel Innovator | Fintech | AI

发布日期: 2023年6月8日

In today's fast-paced computing landscape, optimizing application performance is paramount for developers seeking to harness the full potential of modern hardware. One crucial aspect of optimization is vectorization, which enables parallel execution of operations on multiple data elements simultaneously. By leveraging vector instructions, applications can achieve significant performance gains. Intel Advisor, a powerful profiling and optimization tool, comes to the rescue by providing developers with actionable insights into the areas of their codebase where vectorization can make the most impact.

In this step-by-step guide, we will walk you through the process of installing Intel Advisor and utilizing its capabilities to analyze a sample application. Whether you are a seasoned developer or just getting started, this guide will equip you with the necessary knowledge to identify and optimize the critical sections of your code where vectorization can yield substantial performance improvements. We have also provided a visual reference to this in the form of a youtube video. So, let's dive in and unlock the true potential of your applications by harnessing the power of Intel Advisor!?

To install and use Intel Advisor for analyzing vectorization in your application, follow these steps:

Unpack and build your application.
Establish a performance baseline.
Disambiguate pointers.
Generate instructions for the highest instruction set architecture.

Prerequisites:

Install Intel Advisor either standalone or as part of Intel? oneAPI Base Toolkit.
Install Intel? C++ Compiler Classic either standalone or as part of Intel? oneAPI HPC Toolkit.
Set up environment variables for Intel Advisor and Intel? C++ Compiler Classic.

Note: This guide assumes default installation locations. If you installed the tools in a different location, adjust the paths accordingly in the commands provided.

By following these steps, you'll be able to leverage the power of Intel Advisor to identify the areas in your code where vectorization can have the most significant impact on performance. Let's optimize your application and unleash its true potential!

Unpacking and Building Your Application:

Open the command prompt and navigate to the directory: C:\Program Files (x86)\Intel\oneAPI\advisor\latest\samples\en\C++.
Copy the file "vec_samples.zip" to a writable directory or share it on your system.

Extract the sample from the .zip archive.
Change the directory to the unzipped location, specifically the "vec_samples" directory.
Build the sample application in release mode using the following command:

build.bat baseline.

Run the sample application to verify the successful build by executing: vec_samples.exe.
You should see output confirming the completion of the build, similar to the following:

ROW:47 COL: 47

Execution time is 6.020 seconds

GigaFlops = 0.733887

Sum of result = 254364.540283

Establishing Performance Baseline:

Launch the Intel Advisor GUI from the terminal or command prompt using the command: advisor-gui.
Create a project for the recently built "vec_samples" application, following the instructions in the "Before You Begin" section.
In the Project Properties dialog box, ensure that the "Inherit settings from Survey Hotspots Analysis Type" checkbox is selected for the Trip Counts and FLOP Analysis, Dependencies Analysis, and Memory Access Patterns Analysis types.
Choose the "Vectorization and Code Insights" perspective in the Perspective Selector window.
In the Analysis Workflow pane, set the data collection accuracy level to Low.
Click the "collect" button to run the perspective. At this accuracy level, Intel Advisor performs a Survey analysis and collects performance metrics to identify under- and non-vectorized hotspots.
Now you're ready to proceed with the Intel Advisor GUI and analyze the vectorization and code insights of your application.

Examining Results:

After opening the Vectorization and Code Insights result in the Intel Advisor GUI, you'll be presented with the Summary tab, which serves as a dashboard providing essential information about your application's execution and performance issues. Here's what to notice in the Summary window:

Assess the application's performance using the Elapsed Time metric in the Program Metrics pane. Each improvement made to under- and unvectorized functions/loops contributes to enhancing this metric. Consider reevaluating program elapsed time after each iteration of running the perspective.
In the Program Metrics pane, observe that the Time in scalar code is 100%, indicating that there are no vectorized loops in the application. The Vectorization Gain/Efficiency section is empty. This highlights the need for vectorization improvements.
Note that the Vector Instruction Set metric shows SSE2 and SSE, marked in red. Hovering over the value will display a warning indicating the availability of a higher instruction set architecture. This warning is also presented in the Per Program Recommendations pane. Consider generating instructions for the higher architecture and recompiling your application to enhance performance.
Explore the Top Time-Consuming Loops pane to view the most critical hotspots for optimization. Clicking on the largest hotspot will provide detailed metrics in the Survey Report, allowing for a deeper analysis.
Switch to the Survey & Roofline tab to analyze performance for each loop or function in the application.

In addition to the Summary tab, Intel Advisor offers the ability to create a read-only snapshot for the baseline result. This snapshot can be shared or compared with other results. To create a snapshot:

Click the camera icon.
Enter "snapshot_baseline" in the Result name field.
Enable the Pack into archive checkbox to specify a Result path.
Browse to the desired location and click OK to save the read-only snapshot of the current result.
If the Survey Report remains grayed out after the snapshot process, click anywhere on the report.

To review performance improvements, open the saved result snapshots and compare the metrics with those in the "snapshot_baseline" snapshot.

By carefully examining the Summary window and creating snapshots, you can gain valuable insights into your application's vectorization issues and track performance improvements over time.

Arpit Bhayani 2 年前

Microcode Vulnerabilities: A Gateway to Espionage Part…

Arunas Girdziusas 6 个月前

Tearing Down the Memory Wall

Sharada Yeluri 2 年前

Disambiguating Pointers:

In the Multiply.c file, the compiler generates runtime checks to determine if the pointer "b" in the function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either "a" or "x". This check is necessary for safe vectorization. However, if we know that the pointers do not alias, we can inform the compiler by using the restrict qualifier and the NOALIAS macro. This allows the compiler to avoid the runtime check and generate a single vectorized code path.

To observe the impact of the NOALIAS macro on performance, follow these steps:

Navigate to the vec_samples directory from the same terminal window.
Rebuild the target application with the NOALIAS macro using the command:
build.bat noalias.
This command applies the compiler options: /O2 /Qstd=c99 /fp:fast /Isrc /Zi /Qopenmp /DNOALIAS.

Now let's view the results:

Run the Vectorization and Code Insights perspective with the same configuration as the baseline result.

Check the changes in the Summary window:

In the Program Metrics pane, a new metric "Time in 2 Vectorized Loops" appears, indicating that the compiler successfully vectorized two loops. The time spent in these vectorized loops accounts for 36.6% of the application execution time.
Examine the Vectorization Gain/Efficiency section. The loops are vectorized with 60% efficiency, resulting in a 2.39x speedup compared to their scalar versions. However, there is still room for further improvement. The entire application shows a 1.51x speedup compared to the fully scalar version.
Notice a substantial improvement in the Elapsed time metric.

Open the Survey & Roofline tab to assess the changes in application performance. In the report, observe the following:

The compiler successfully vectorized two loops, specifically in matvec at Multiply.c:69 and in matvec at Multiply.c:60.
The loop in matvec at Multiply.c:60 has a high efficiency of 99% and an estimated gain of 3.96x. However, the efficiency of the loop in matvec at Multiply.c:69 is lower (25%), and the bar is gray, indicating that the achieved vectorization efficiency is lower than the original scalar loop efficiency. Hover over a bar in the Efficiency column to view an explanation for the estimated efficiency.

Click the icon next to the two vectorized loops. Note that both loops have a remainder loop present. Click the icon in the Trip Counts column to expand it. The remainder loops exist because the trip count values for these loops are not multiples of the VL (Vector Length) value.

Finally, create a read-only snapshot of the current result to compare with other snapshots or share with others for further analysis.

By disambiguating pointers and informing the compiler about their non-aliasing nature, you can improve the efficiency and performance of vectorized loops in your application.

Generating Instructions for the Highest Instruction Set Architecture:

To further improve performance, you can generate code optimized for the highest instruction set available on your compilation host processor. The QxHost option instruct the compiler to generate instructions for the highest available instruction set.

To assess the impact of these options on performance, follow these steps:

Rebuild the target application using the following command:

build.bat xhost

This command applies the compiler options: /O2 /Qstd=c99 /fp:fast /Isrc /Zi /Qopenmp /DNOALIAS /QxHost.
After rebuilding, run the Vectorization and Code Insights perspective.

Running Vectorization and Code Insights:

Open the project in the GUI:
advisor-gui .\vec_samples
In the Analysis Workflow pane for the Vectorization and Code Insights perspective, set the data collection accuracy level to Medium.
At this accuracy level, Intel Advisor collects Survey and Characterization (Trip Counts) data.
Run the perspective.
Viewing the Results:
Check the changes in the Summary window and open the Survey Report to assess the application's performance. Note the following observations:

The Elapsed time metric is likely to improve, indicating better overall performance.
The values in the Vector ISA (Instruction Set Architecture) and VL (Vector Length) columns in the top pane are expected to change, reflecting the utilization of the highest available instruction set and its corresponding vector length.

Creating a Read-only Snapshot:

Click the icon in the GUI and save a snapshot with the name "snapshot_xhost" to preserve the current result for future reference or comparison.

By generating instructions for the highest instruction set architecture available on your compilation host processor, you can potentially unlock additional performance improvements in your application.

By following the steps outlined in this guide, you can effectively install Intel Advisor, analyze a sample application, and identify areas where vectorization can significantly impact performance. Through unpacking and building the application, establishing a performance baseline, disambiguating pointers, and generating instructions for the highest instruction set architecture, you can optimize your code for improved vectorization.

Shriram Vasudevan (FIE, FIETE,SMIEEE)

1 年

Arun G K it is a terrific effort you are showcasing in this series. Congratulations and thank you. Personally it's a lot of learning for both of us. :)

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

A step-by-step guide to install Intel Advisor and analyze a sample application and find out where Vectorization matters the most

Arun GK

Backend Developer | Intel Innovator | Fintech | AI

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

?? Docker: Solving “It Works on My Machine”… or Did It? ??

Exploring Bit Scan Forward

IBM's POWER10 chip is too small !!

Followup to my GPUs as Processors Article

Interrupt Handling in ARM Cortex M Core

Performance analysis & tuning on modern CPU

Intel Highlights Benefits of Software Optimized Silicon

Intel Highlights Benefits of Software Optimized Silicon

To harness benefits of parallel processing

Down the Rabbit Hole: Optimizing AWS F1 Direct Memory Access (DMA)

领英推荐

Intel? OSPRay: Revolutionizing Real-time Rendering with High-Fidelity Graphics

2023年7月31日

Intel? Open Volume Kernel Library (Intel? Open VKL): Advancing 3D Spatial Data Rendering and Simulation

2023年7月28日

Intel? Open Path Guiding Library (Intel? Open PGL): Advancing Realistic Rendering in Computer Graphics

2023年7月27日

Intel OpenSWR: Accelerating High-Performance Rendering for Modern Computing

2023年7月26日

Enhancing Visual Fidelity with Intel? Open Image Denoise

2023年7月25日

Intel Embree: Empowering High-Performance Ray Tracing for Stunning Visuals

2023年7月1日

Intel Query Processing Library (QPL): Enhancing Performance and Efficiency in Database Applications

2023年6月30日

Intel oneAPI Video Processing Library: Accelerating Video Codecs for Enhanced Performance

2023年6月29日

Intel? oneAPI Threading Building Blocks: Revolutionizing Scalable Parallel Programming for Accelerated Architectures

2023年6月28日

Title: Intel's oneAPI Math Kernel Library: Empowering Developers for Optimal Performance

2023年6月27日

社区洞察

其他会员也浏览了

?? Docker: Solving “It Works on My Machine”… or Did It? ??

Exploring Bit Scan Forward

IBM's POWER10 chip is too small !!

Followup to my GPUs as Processors Article

Interrupt Handling in ARM Cortex M Core

Performance analysis & tuning on modern CPU

Intel Highlights Benefits of Software Optimized Silicon

Intel Highlights Benefits of Software Optimized Silicon

To harness benefits of parallel processing

Down the Rabbit Hole: Optimizing AWS F1 Direct Memory Access (DMA)