登录查看更多内容

High-performance Computing in C++

Shrikant Badiger

Senior Member Of Technical Staff at Broadcom

发布日期: 2022年9月26日

Single Instruction Multiple Data (SIMD)

Multiple core CPUs and Multithreading: Declarative(OpenMP), imperative (std::thead and pthread) Intel TBB and Microsoft PPL)

The common mathematical function can be executed with maximum efficiency.

Machine cluster: MPI - Message Passing Instruction

Custom Hardware: GPGPU: OpenCL, CUDA, C++ AMP

Hardware Accelerator: Intel Xeon Phi, ASIC/FGPA Architecture

?Single Instruction Multiple Data(SIMD)

Motivation:

Processor instructions execution takes a different type of CPU. It also depends on the data type which is used inside the calculations or in execution instruction. Let's check the order of operations based on the CPU cycle need.

ADD< MUTIPLY < DIVIDE < SUARE ROOT

Let consider the example of equation - ax2+bx+x(a+b)?--> this will consume some optimized CPUs. A lot of effort will be spent on equation optimization. Operations that are performed on arrays or vectors also consume a sufficient amount of CPUs.

?Registers: SIMD speeds up by large register set ie 128 bit registers. It can be further split into 16 bytes or 8 shorts or 4 int/float or 2 doubles to hold the value for computation. CPU provides several registers for SIMD and these registers will vary in size from 128 to 512 bits. If we have a larger register then more data can be packed up for effective and speed computation. There are come with some cast. We have to be very careful and we may end up with incompatibility issues. There are many CPU developers and they have a different set of configurations and different library supportability.?

SIMD Technology:

AMD: 3DNow, Intel:MMX,SSE, Both:AVX

Streaming SIMD extensions(SSE)- In this architecture, 128Bit registers has names like xmm0 to Xmm7, etc.

We can refer to each register by name and access accordingly.

AVX: Advanced Vector Extension

Extends xmm to ymm like bits from 128ymm to 255 ymm and later it starts zmm that’s from 256 to 511.

Instructions:

SIMD is supported with special instructions. These instructions generally support integral and floating point operations. It mainly concentrates on mathematical operations like multiplication.

?We can distinguish the instructions as scalar and packed data. We can use this instruction for scalar data like single-precision values and if it's arrays of multiple precision values. Such kind of problems can be solved by packed data processing instructions.

These are some ways to use instructions 1) Inline Assembly 2) intrinsic and 3) Compiler Vectorization

Inline Assembly

?Let's consider the following example to illustrate the inline Assembly instruction usage.

领英推荐

How and Why RISC Architectures Took Over from CISC…

Peter Smulovics 2 个月前

Large Language Models and Hardware: A Comparative…

Somya Rai 1 年前

A Deep Dive into the Android Boot Process:…

Stefano Santilli 6 个月前

string get_cpu_name() {

uint32_t data[4];

_asm {

cpuid;

mov data[0], ebx; // This is inline Assembly code to get the cpu information

mov data[4], edx;

mov data[8], ecx;

}

return string((const char*)data);

}

void assembler() {

cout<<"\n CPU Name:"<<get_cpu_name();

}

int main(int argc, char* argv[]) {

getchar();

return 0;

}

Let's consider the example of packed Data and packed data instructions

?void assembler() 

???float f1[]={1.f, 2,f,3.f,4,f};

???float f2[]={4.f,3.f,2.f,1.f};

???float result[4] = {0.f};

????_asm {

????????movups xmm1,f1; // move the unaligned packed data

????????movups xmm2,f2;

????????mulps xmm1,xmm2;

????}????

?for(size_t i=0;i<4;i++) {

???????cout<<result[i]<<endl;

??????}

}?

Intrinsics:

Inline assembly is painful and faces some dificuties to manage the code and writing inline code. So intrinsics provide the c++ wrapper around the instructions and registers.

It provides the data type and it starts from like __m<bits><type postfix> datatype --> __m128i, __mm256d

Type of prefix can be

?????i-- express memory as an INT array
?????d -- expressed as a DOUBLE array
????empty expressed as FLOAT array

Instructions are wrapped with functions. But arithmetic operations are not supported.

The following example provides some details about it.

void intrinsics(

{

???auto a = _mm_set_ps(1, 2, 3, 4);

???auto b = _mm_set_ps(4, 3, 2, 1);

???auto result = _mm_add_ps(a, b); // getting the addition value

???// if we wish to get the first indexed values

???float f = result.m128_f32[0]; // it will give first value after addition

})?

Compiler Vectorization:

Let's consider the optimization of huge data calculation by taking the random numbers

void simple_mad(float *a, float *b, float *c, float *result, int length
{
    for (size_t i = 0; i < length; i++)
    {
        result[i] = a[i] * b[i] + c[i];
    }
}
void optimization()
{
    const int length = 1024 * 1024 * 64;
    float *a = new float[length];
    float *b = new float[length];
    float *c = new float[length];
    float *result = new float[length];


    mt19937_64 rng(random_device{}());
    uniform_real_distribution<float> dist(0, 1);


    for (size_t i = 0; i < length; i++)
    {
        a[i] = dist(rng);
        b[i] = dist(rng);
        c[i] = dist(rng);
    }


    // let call the custom simple_mad() funcltion
    // to find out the time calculation


    using std::namespace ::chrono;
    auto begin = high_resolution_clock::now();
    simple_mad(a, b, c, resutl, length);
    auto end = high_resolution_clock::now();
    cout << "\n Time : " << duration_cast<milliseconds>(end - begin);
    // delete the allocated space
    delete[] a;
    delete[] b;
    delete[] c;
    delete[] result;
}

?The above code took 686ms for execution without enabling the optimization and checking the generated .asm file. The asm file not having any optimized operations. It has a plain assembly code. After I have enabled the optimization enabled like O3 that was provided by the intel and I have made the loop unrolling to zero in my visual studio setting to roll out the loops for better performance.??Intel compiler provides some optimization notes, you can refer to them for more information. It is very interesting to learn. Generate the .asm file and check the add instruction in our custom function. If it is used properly, then the code is optimized for more performance. It takes the lesser time compared to earlier execution. For more information please refer intel developer guide.?

Some screenshots to represent the configuration used in the compilation

next: OpenMP, MPI and C++ AMP.

Please provide your suggestion or If you fixed any performance-related issues in any of the C++ projects. Provide some details.

要查看或添加评论，请登录

Shrikant Badiger的更多文章

NVMe Over TCP

2024年6月27日

NVMe Over TCP

NVMe over TCP is enhanced feature of NVMe over Fabrics. It used the standard network stack(Ethernet) without any…

1 条评论
Bazel Build for C++ Software Application

2023年2月1日

Bazel Build for C++ Software Application

Bazel Tool is developed by google to automate the build process. Now It's an open source and it can be used by anyone.
C++ Class Layout

2022年10月4日

C++ Class Layout

Class Layout: Only non-static data members will contribute to the size of the class object. If we have static and…

1 条评论
High-performance Computing in C++ : Open Muti Processing(OpenMP)

2022年9月28日

High-performance Computing in C++ : Open Muti Processing(OpenMP)

Open Multi-Processing: Let's consider the parallelization approaches, basically, we can think of imperative…
vSocket Interface - Guest to Host Communication

2022年8月23日

vSocket Interface - Guest to Host Communication

vSocket: VMware vSocket provides a very similar API to the Unix Socker interface for communication. vSocket library is…
Custom Memory Management in C++

2022年8月18日

Custom Memory Management in C++

Memory Management: Process in which memory allocation and de-allocation to the variable in running program and handle…
Pointers in C

2022年6月10日

Pointers in C

Pointers in C: Pointers are fundamental parts of C Programming. Pointers provide the lots of power and flexibility in C…
CMake and Useful Info

2022年6月6日

CMake and Useful Info

CMake is an open-source tool to build, test, and package software applications. CMake provides control over the…

1 条评论
Interrupt !!

2022年4月7日

Interrupt !!

Processors need to detect hardware activities. There are multiple solutions to detect hardware activities.
PXE: Preboot Execution Environment

2022年4月6日

PXE: Preboot Execution Environment

PXE: Preboot Execution Environment. Technology helps computers to boot up remotely through a network interface.

See all articles

High-performance Computing in C++

Shrikant Badiger

Senior Member Of Technical Staff at Broadcom

领英推荐

Shrikant Badiger的更多文章

社区洞察

其他会员也浏览了

x86 protected mode and Long Mode x86-64 and the equivalents on ARM.

Choosing the Right Server for Your Computer Vision Project: Key Criteria to Consider

Why GPU Can Process Image Much Faster than CPU?

From Silicon to Quantum: Unlocking the Secrets of Modern Computing

Machine Learning Edition GPU vs CPU and their characteristics in ML

Followup to my GPUs as Processors Article

Unlocking CPU Performance: Strategies to Minimize Pipeline Deadlocks and Instruction Latency, part III

GPU vs CPU: A Deep Dive into Their Architecture and Functions

Why GPU Can Process Image Much Faster than CPU?

Banana Pi BPI-CM5 Pro computer module with Rockchip RK3576

领英推荐

Shrikant Badiger的更多文章

NVMe Over TCP

Bazel Build for C++ Software Application

C++ Class Layout