High-performance Computing in C++

High-performance Computing in C++

Single Instruction Multiple Data (SIMD)

Multiple core CPUs and Multithreading: Declarative(OpenMP), imperative (std::thead and pthread) Intel TBB and Microsoft PPL)

The common mathematical function can be executed with maximum efficiency.

Machine cluster: MPI - Message Passing Instruction

Custom Hardware: GPGPU: OpenCL, CUDA, C++ AMP

Hardware Accelerator: Intel Xeon Phi, ASIC/FGPA Architecture

?Single Instruction Multiple Data(SIMD)

Motivation:

Processor instructions execution takes a different type of CPU. It also depends on the data type which is used inside the calculations or in execution instruction. Let's check the order of operations based on the CPU cycle need.

ADD< MUTIPLY < DIVIDE < SUARE ROOT

Let consider the example of equation - ax2+bx+x(a+b)?--> this will consume some optimized CPUs. A lot of effort will be spent on equation optimization. Operations that are performed on arrays or vectors also consume a sufficient amount of CPUs.

?Registers: SIMD speeds up by large register set ie 128 bit registers. It can be further split into 16 bytes or 8 shorts or 4 int/float or 2 doubles to hold the value for computation. CPU provides several registers for SIMD and these registers will vary in size from 128 to 512 bits. If we have a larger register then more data can be packed up for effective and speed computation. There are come with some cast. We have to be very careful and we may end up with incompatibility issues. There are many CPU developers and they have a different set of configurations and different library supportability.?

SIMD Technology:

AMD: 3DNow, Intel:MMX,SSE, Both:AVX

Streaming SIMD extensions(SSE)- In this architecture, 128Bit registers has names like xmm0 to Xmm7, etc.

We can refer to each register by name and access accordingly.

AVX: Advanced Vector Extension

Extends xmm to ymm like bits from 128ymm to 255 ymm and later it starts zmm that’s from 256 to 511.

Instructions:

SIMD is supported with special instructions. These instructions generally support integral and floating point operations. It mainly concentrates on mathematical operations like multiplication.

?We can distinguish the instructions as scalar and packed data. We can use this instruction for scalar data like single-precision values and if it's arrays of multiple precision values. Such kind of problems can be solved by packed data processing instructions.

These are some ways to use instructions 1) Inline Assembly 2) intrinsic and 3) Compiler Vectorization

  • Inline Assembly

?Let's consider the following example to illustrate the inline Assembly instruction usage.

string get_cpu_name() {

uint32_t data[4];

_asm {

cpuid;

mov data[0], ebx; // This is inline Assembly code to get the cpu information

mov data[4], edx;

mov data[8], ecx;

}

return string((const char*)data);

}

void assembler() {

cout<<"\n CPU Name:"<<get_cpu_name();

}

int main(int argc, char* argv[]) {

getchar();

return 0;

}        


Let's consider the example of packed Data and packed data instructions

?void assembler() 

???float f1[]={1.f, 2,f,3.f,4,f};

???float f2[]={4.f,3.f,2.f,1.f};

???float result[4] = {0.f};

????_asm {

????????movups xmm1,f1; // move the unaligned packed data

????????movups xmm2,f2;

????????mulps xmm1,xmm2;

????}????

?for(size_t i=0;i<4;i++) {

???????cout<<result[i]<<endl;

??????}

}?        

  • Intrinsics:

Inline assembly is painful and faces some dificuties to manage the code and writing inline code. So intrinsics provide the c++ wrapper around the instructions and registers.

It provides the data type and it starts from like __m<bits><type postfix> datatype --> __m128i, __mm256d

Type of prefix can be

  • ?????i-- express memory as an INT array
  • ?????d -- expressed as a DOUBLE array
  • ????empty expressed as FLOAT array

Instructions are wrapped with functions. But arithmetic operations are not supported.

The following example provides some details about it.

void intrinsics(

{

???auto a = _mm_set_ps(1, 2, 3, 4);

???auto b = _mm_set_ps(4, 3, 2, 1);

???auto result = _mm_add_ps(a, b); // getting the addition value

???// if we wish to get the first indexed values

???float f = result.m128_f32[0]; // it will give first value after addition

})?        

  • Compiler Vectorization:

Let's consider the optimization of huge data calculation by taking the random numbers


void simple_mad(float *a, float *b, float *c, float *result, int length
{
    for (size_t i = 0; i < length; i++)
    {
        result[i] = a[i] * b[i] + c[i];
    }
}
void optimization()
{
    const int length = 1024 * 1024 * 64;
    float *a = new float[length];
    float *b = new float[length];
    float *c = new float[length];
    float *result = new float[length];


    mt19937_64 rng(random_device{}());
    uniform_real_distribution<float> dist(0, 1);


    for (size_t i = 0; i < length; i++)
    {
        a[i] = dist(rng);
        b[i] = dist(rng);
        c[i] = dist(rng);
    }


    // let call the custom simple_mad() funcltion
    // to find out the time calculation


    using std::namespace ::chrono;
    auto begin = high_resolution_clock::now();
    simple_mad(a, b, c, resutl, length);
    auto end = high_resolution_clock::now();
    cout << "\n Time : " << duration_cast<milliseconds>(end - begin);
    // delete the allocated space
    delete[] a;
    delete[] b;
    delete[] c;
    delete[] result;
}        

?The above code took 686ms for execution without enabling the optimization and checking the generated .asm file. The asm file not having any optimized operations. It has a plain assembly code. After I have enabled the optimization enabled like O3 that was provided by the intel and I have made the loop unrolling to zero in my visual studio setting to roll out the loops for better performance.??Intel compiler provides some optimization notes, you can refer to them for more information. It is very interesting to learn. Generate the .asm file and check the add instruction in our custom function. If it is used properly, then the code is optimized for more performance. It takes the lesser time compared to earlier execution. For more information please refer intel developer guide.?

Some screenshots to represent the configuration used in the compilation

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

next: OpenMP, MPI and C++ AMP.

Please provide your suggestion or If you fixed any performance-related issues in any of the C++ projects. Provide some details.

要查看或添加评论,请登录

Shrikant Badiger的更多文章

  • NVMe Over TCP

    NVMe Over TCP

    NVMe over TCP is enhanced feature of NVMe over Fabrics. It used the standard network stack(Ethernet) without any…

    1 条评论
  • Bazel Build for C++ Software Application

    Bazel Build for C++ Software Application

    Bazel Tool is developed by google to automate the build process. Now It's an open source and it can be used by anyone.

  • C++ Class Layout

    C++ Class Layout

    Class Layout: Only non-static data members will contribute to the size of the class object. If we have static and…

    1 条评论
  • High-performance Computing in C++ : Open Muti Processing(OpenMP)

    High-performance Computing in C++ : Open Muti Processing(OpenMP)

    Open Multi-Processing: Let's consider the parallelization approaches, basically, we can think of imperative…

  • vSocket Interface - Guest to Host Communication

    vSocket Interface - Guest to Host Communication

    vSocket: VMware vSocket provides a very similar API to the Unix Socker interface for communication. vSocket library is…

  • Custom Memory Management in C++

    Custom Memory Management in C++

    Memory Management: Process in which memory allocation and de-allocation to the variable in running program and handle…

  • Pointers in C

    Pointers in C

    Pointers in C: Pointers are fundamental parts of C Programming. Pointers provide the lots of power and flexibility in C…

  • CMake and Useful Info

    CMake and Useful Info

    CMake is an open-source tool to build, test, and package software applications. CMake provides control over the…

    1 条评论
  • Interrupt !!

    Interrupt !!

    Processors need to detect hardware activities. There are multiple solutions to detect hardware activities.

  • PXE: Preboot Execution Environment

    PXE: Preboot Execution Environment

    PXE: Preboot Execution Environment. Technology helps computers to boot up remotely through a network interface.

社区洞察

其他会员也浏览了