登录查看更多内容

4> Optimizing C++ - N-Body Problem, Reloaded (with CUDA)

Guido Giuntoli

发布日期: 2020年5月18日

As I promised, here is my first approach to offloading the N-Body with CUDA. GPUs are not only games today, but they are also a powerful weapon to solve problems full of floating-point and integers operations and which exhibit a lot of parallelism. The N-Body is not an exception but is a good case for feeding the GPU since particle tracking computation is independent between them. The bad feature of this O(N^2) problem is that the force over every particle depends on the position of all the other particles in an instant of time; this means that the GPU memory should have all the problem in the global memory and the problem is not scalable to several GPUs.

CUDA is a computing platform and a programming model that allows expressing parallelism in an elegant and simple way to execute the code in GPUs. Developed by NVIDIA only works for NVIDIA GPUs. OpenCL is another alternative to do GPU programming which is suitable for any graphical device. CUDA as a difference of OpenACC allows much more hardware control allowing to achieve sometimes more throughput. Some programmers complain about CUDA in this sense but is the price that one has to pay in order to achieve the best performance for a give hardware.

So, let's take a look to our case, the function integrate that was written in main.cpp has the following structure:

void Problem::integrate()
{

// Expensive loop

for (int pi = 0; pi < mNumParticles; pi++) {
     for (int pj = 0; pj < mNumParticles; pj++) {
         ...
     }
}

// Cheap loop
for (int pi = 0; pi < mNumParticles; pi++) {
    ...
}
}

What we need now is to understand the working principle of GPUs. Basically, these devices are made with several streams multiprocessors each of them with tens or hundreds of cores that operate on different data with the Single Instruction Multiple Data (SIMD) parallelism. CUDA permits to express the SIMD through a series of keywords in a C-style way. The program as is ported to CUDA is not C neither C++ anymore. The CUDA compiler Nvcc transform does parts that runs on the CPU to C or/and C++ and those parts that run on the GPU to the machine code for the graphical device. In the CUDA version main-cuda.cpp I had replace the CPU version function integrate() with this CUDA kernel version called integrateKernel():

__global__

void integrateKernel(Particle *particles, int numParticles,
                     double GM2, double inverseMass, double dt)

{


for (int pi = blockIdx.x * blockDim.x + threadIdx.x;
     pi < numParticles;
     pi += blockDim.x * gridDim.x) {

     for (int pj = 0; pj < mNumParticles; pj++) {
         ...
     }
}
__syncthreads();

for (int pi = blockIdx.x * blockDim.x + threadIdx.x;
     pi < numParticles;
     pi += blockDim.x * gridDim.x) {
     
     ...
}
__syncthreads();
}

Note I added the decorator __global__ in the header of the function. Also the keywords blockIdx.x, blockDim.x, and threadIdx.x permit to CUDA to express the thread's instruction on multiple data. With this, a single thread (thousands are going to be executed at the same in the GPU) will compute a small set of the total number of particles. The bad thing about this implementation is that is not good exploitation of the local memory, also called, shared memory in CUDA terms; each thread in the same thread block is requesting all the time new data to the global memory of the GPU.

To called this kernel I simply created back again the old function integrate() but in this case, I have called the CUDA kernel from inside:

void Problem::integrate()
{
    integrateKernel<<<8 * mNumSMs, 64>>>(d_mParticles, mNumParticles, 
                                          mGM2, mInverseMass, mDt);

There mNumSMs is the number of stream multiprocessors in the GPU. This number depends on each GPU model, in my case, the GeForce GTX 1050 I am using, has 6 SMs.

The setup for this case is the same as the previous article: 5 time-steps and 100K particles. This will allow me to compare the performance of CUDA with OpenACC.

Now running compiling and running with the profiling tool of NVIDIA: Nvprof we have:

$ nvcc -x cu main-cuda.cpp -o main-cuda -O3
$ nvprof ./main-cuda
...
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  37.5210s         5  7.50419s  7.45996s  7.52978s  integrateKernel(Particle*, int, double, double, double)
                    0.00%  1.3059ms         1  1.3059ms  1.3059ms  1.3059ms  [CUDA memcpy HtoD]
      API calls:   99.50%  37.5210s         1  37.5210s  37.5210s  37.5210s  cudaFree
                    0.49%  185.88ms         1  185.88ms  185.88ms  185.88ms  cudaMalloc
                    0.00%  1.4589ms         1  1.4589ms  1.4589ms  1.4589ms  cudaMemcpy
..

Note that 100 % of the time (rounded) computations are localized in the integration() function (or the CUDA kernel). This means that the data movement from host to device and vice versa is not the main concern in this particular algorithm. This is explained since we are sending N particles, or O(N) data, to perform an equivalent of O(N^2) computations. In a more efficient algorithm probably the data movement would start having more contribution to the total execution time.

The next figure summarizes the results of the CUDA execution in comparison to our single-core CPU, the OpenMP and the OpenACC executions obtained in the previous articles:

Note that the speedup with CUDA is lightly better than OpenACC. I didn't optimize too much the CUDA kernel since my first objective was to port the code and get at least a performance better or similar to OpenACC. Even I can dedicate more time to optimize the CUDA part is not obvious that I will be able to improve it. OpenACC, in this case, has a nice game since the performance is almost the same as CUDA which much less programming effort. We can expect more of OpenACC in the future, and also of OpenMP which now allows offloading code to GPUs.

Eduardo J. Sanchez

Front-Office Lead Software Engineer at J.P. Morgan ? Country count: 33

4 年

In my experience, adding thread scheduling to the study yields pretty interesting insights.

要查看或添加评论，请登录

Guido Giuntoli的更多文章

Solidity smart contract to transfer Ethers to friends or relatives when we pass away

2021年10月31日

Solidity smart contract to transfer Ethers to friends or relatives when we pass away

There may be cases when we desire to transfer the funds on our crypto to our friends or relatives when we die to…
3> Optimizing C++ - Offloading N-Body Problem to GPUs with OpenACC

2020年5月9日

3> Optimizing C++ - Offloading N-Body Problem to GPUs with OpenACC

Glory Moore Law days have gone. Before 2003 the number of transistors per processor duplicated every 2 years.
2> Optimizing C++ - N-Body Problem with OpenMP

2020年5月1日

2> Optimizing C++ - N-Body Problem with OpenMP

In this new article, I would show how to accelerate a C++ code to solve the N-Body problem with OpenMP. This is the…
1> Optimizing C++ - Brute Force Problem

2020年4月17日

1> Optimizing C++ - Brute Force Problem

I write this article to show some optimizations in C++ can accelerate enormously the execution of sequential codes. In…

3 条评论

4> Optimizing C++ - N-Body Problem, Reloaded (with CUDA)

Guido Giuntoli

Guido Giuntoli的更多文章

社区洞察

其他会员也浏览了

Unleashing Apple Silicon's Machine Learning Prowess: A Deep Dive into MLX

Stream MultiProcessors in GPU

AI Hardware: CPU vs GPU vs NPU

Troubleshooting the Most Common CUDA Installation Errors

What Is CUDA? Understanding Its Origins, Mechanics, Evolution, and Importance for AI

What is the GPGPU, the King of AI Computing Chips?

Introduction To GPUs

Apple's Metal: Unleashing the Power of Silicon for Graphics and Machine Learning

Accelerated Computing with C++

Guido Giuntoli的更多文章

Solidity smart contract to transfer Ethers to friends or relatives when we pass away

3> Optimizing C++ - Offloading N-Body Problem to GPUs with OpenACC

2> Optimizing C++ - N-Body Problem with OpenMP

1> Optimizing C++ - Brute Force Problem

社区洞察

其他会员也浏览了

Unleashing Apple Silicon's Machine Learning Prowess: A Deep Dive into MLX

Stream MultiProcessors in GPU

AI Hardware: CPU vs GPU vs NPU

Troubleshooting the Most Common CUDA Installation Errors

What Is CUDA? Understanding Its Origins, Mechanics, Evolution, and Importance for AI

What is the GPGPU, the King of AI Computing Chips?

Introduction To GPUs

Apple's Metal: Unleashing the Power of Silicon for Graphics and Machine Learning

Accelerated Computing with C++