High-performance Computing in C++
Single Instruction Multiple Data (SIMD)
Multiple core CPUs and Multithreading: Declarative(OpenMP), imperative (std::thead and pthread) Intel TBB and Microsoft PPL)
The common mathematical function can be executed with maximum efficiency.
Machine cluster: MPI - Message Passing Instruction
Custom Hardware: GPGPU: OpenCL, CUDA, C++ AMP
Hardware Accelerator: Intel Xeon Phi, ASIC/FGPA Architecture
?Single Instruction Multiple Data(SIMD)
Motivation:
Processor instructions execution takes a different type of CPU. It also depends on the data type which is used inside the calculations or in execution instruction. Let's check the order of operations based on the CPU cycle need.
ADD< MUTIPLY < DIVIDE < SUARE ROOT
Let consider the example of equation - ax2+bx+x(a+b)?--> this will consume some optimized CPUs. A lot of effort will be spent on equation optimization. Operations that are performed on arrays or vectors also consume a sufficient amount of CPUs.
?Registers: SIMD speeds up by large register set ie 128 bit registers. It can be further split into 16 bytes or 8 shorts or 4 int/float or 2 doubles to hold the value for computation. CPU provides several registers for SIMD and these registers will vary in size from 128 to 512 bits. If we have a larger register then more data can be packed up for effective and speed computation. There are come with some cast. We have to be very careful and we may end up with incompatibility issues. There are many CPU developers and they have a different set of configurations and different library supportability.?
SIMD Technology:
AMD: 3DNow, Intel:MMX,SSE, Both:AVX
Streaming SIMD extensions(SSE)- In this architecture, 128Bit registers has names like xmm0 to Xmm7, etc.
We can refer to each register by name and access accordingly.
AVX: Advanced Vector Extension
Extends xmm to ymm like bits from 128ymm to 255 ymm and later it starts zmm that’s from 256 to 511.
Instructions:
SIMD is supported with special instructions. These instructions generally support integral and floating point operations. It mainly concentrates on mathematical operations like multiplication.
?We can distinguish the instructions as scalar and packed data. We can use this instruction for scalar data like single-precision values and if it's arrays of multiple precision values. Such kind of problems can be solved by packed data processing instructions.
These are some ways to use instructions 1) Inline Assembly 2) intrinsic and 3) Compiler Vectorization
?Let's consider the following example to illustrate the inline Assembly instruction usage.
领英推荐
string get_cpu_name() {
uint32_t data[4];
_asm {
cpuid;
mov data[0], ebx; // This is inline Assembly code to get the cpu information
mov data[4], edx;
mov data[8], ecx;
}
return string((const char*)data);
}
void assembler() {
cout<<"\n CPU Name:"<<get_cpu_name();
}
int main(int argc, char* argv[]) {
getchar();
return 0;
}
Let's consider the example of packed Data and packed data instructions
?void assembler()
???float f1[]={1.f, 2,f,3.f,4,f};
???float f2[]={4.f,3.f,2.f,1.f};
???float result[4] = {0.f};
????_asm {
????????movups xmm1,f1; // move the unaligned packed data
????????movups xmm2,f2;
????????mulps xmm1,xmm2;
????}????
?for(size_t i=0;i<4;i++) {
???????cout<<result[i]<<endl;
??????}
}?
Inline assembly is painful and faces some dificuties to manage the code and writing inline code. So intrinsics provide the c++ wrapper around the instructions and registers.
It provides the data type and it starts from like __m<bits><type postfix> datatype --> __m128i, __mm256d
Type of prefix can be
Instructions are wrapped with functions. But arithmetic operations are not supported.
The following example provides some details about it.
void intrinsics(
{
???auto a = _mm_set_ps(1, 2, 3, 4);
???auto b = _mm_set_ps(4, 3, 2, 1);
???auto result = _mm_add_ps(a, b); // getting the addition value
???// if we wish to get the first indexed values
???float f = result.m128_f32[0]; // it will give first value after addition
})?
Let's consider the optimization of huge data calculation by taking the random numbers
void simple_mad(float *a, float *b, float *c, float *result, int length
{
for (size_t i = 0; i < length; i++)
{
result[i] = a[i] * b[i] + c[i];
}
}
void optimization()
{
const int length = 1024 * 1024 * 64;
float *a = new float[length];
float *b = new float[length];
float *c = new float[length];
float *result = new float[length];
mt19937_64 rng(random_device{}());
uniform_real_distribution<float> dist(0, 1);
for (size_t i = 0; i < length; i++)
{
a[i] = dist(rng);
b[i] = dist(rng);
c[i] = dist(rng);
}
// let call the custom simple_mad() funcltion
// to find out the time calculation
using std::namespace ::chrono;
auto begin = high_resolution_clock::now();
simple_mad(a, b, c, resutl, length);
auto end = high_resolution_clock::now();
cout << "\n Time : " << duration_cast<milliseconds>(end - begin);
// delete the allocated space
delete[] a;
delete[] b;
delete[] c;
delete[] result;
}
?The above code took 686ms for execution without enabling the optimization and checking the generated .asm file. The asm file not having any optimized operations. It has a plain assembly code. After I have enabled the optimization enabled like O3 that was provided by the intel and I have made the loop unrolling to zero in my visual studio setting to roll out the loops for better performance.??Intel compiler provides some optimization notes, you can refer to them for more information. It is very interesting to learn. Generate the .asm file and check the add instruction in our custom function. If it is used properly, then the code is optimized for more performance. It takes the lesser time compared to earlier execution. For more information please refer intel developer guide.?
Some screenshots to represent the configuration used in the compilation
next: OpenMP, MPI and C++ AMP.
Please provide your suggestion or If you fixed any performance-related issues in any of the C++ projects. Provide some details.