4> Optimizing C++ - N-Body Problem, Reloaded (with CUDA)
As I promised, here is my first approach to offloading the N-Body with CUDA. GPUs are not only games today, but they are also a powerful weapon to solve problems full of floating-point and integers operations and which exhibit a lot of parallelism. The N-Body is not an exception but is a good case for feeding the GPU since particle tracking computation is independent between them. The bad feature of this O(N^2) problem is that the force over every particle depends on the position of all the other particles in an instant of time; this means that the GPU memory should have all the problem in the global memory and the problem is not scalable to several GPUs.
CUDA is a computing platform and a programming model that allows expressing parallelism in an elegant and simple way to execute the code in GPUs. Developed by NVIDIA only works for NVIDIA GPUs. OpenCL is another alternative to do GPU programming which is suitable for any graphical device. CUDA as a difference of OpenACC allows much more hardware control allowing to achieve sometimes more throughput. Some programmers complain about CUDA in this sense but is the price that one has to pay in order to achieve the best performance for a give hardware.
So, let's take a look to our case, the function integrate that was written in main.cpp has the following structure:
void Problem::integrate() { // Expensive loop for (int pi = 0; pi < mNumParticles; pi++) { for (int pj = 0; pj < mNumParticles; pj++) { ... } } // Cheap loop for (int pi = 0; pi < mNumParticles; pi++) { ... } }
What we need now is to understand the working principle of GPUs. Basically, these devices are made with several streams multiprocessors each of them with tens or hundreds of cores that operate on different data with the Single Instruction Multiple Data (SIMD) parallelism. CUDA permits to express the SIMD through a series of keywords in a C-style way. The program as is ported to CUDA is not C neither C++ anymore. The CUDA compiler Nvcc transform does parts that runs on the CPU to C or/and C++ and those parts that run on the GPU to the machine code for the graphical device. In the CUDA version main-cuda.cpp I had replace the CPU version function integrate() with this CUDA kernel version called integrateKernel():
__global__ void integrateKernel(Particle *particles, int numParticles, double GM2, double inverseMass, double dt) { for (int pi = blockIdx.x * blockDim.x + threadIdx.x; pi < numParticles; pi += blockDim.x * gridDim.x) { for (int pj = 0; pj < mNumParticles; pj++) { ... } } __syncthreads(); for (int pi = blockIdx.x * blockDim.x + threadIdx.x; pi < numParticles; pi += blockDim.x * gridDim.x) { ... } __syncthreads(); }
Note I added the decorator __global__ in the header of the function. Also the keywords blockIdx.x, blockDim.x, and threadIdx.x permit to CUDA to express the thread's instruction on multiple data. With this, a single thread (thousands are going to be executed at the same in the GPU) will compute a small set of the total number of particles. The bad thing about this implementation is that is not good exploitation of the local memory, also called, shared memory in CUDA terms; each thread in the same thread block is requesting all the time new data to the global memory of the GPU.
To called this kernel I simply created back again the old function integrate() but in this case, I have called the CUDA kernel from inside:
void Problem::integrate() { integrateKernel<<<8 * mNumSMs, 64>>>(d_mParticles, mNumParticles, mGM2, mInverseMass, mDt);
}
There mNumSMs is the number of stream multiprocessors in the GPU. This number depends on each GPU model, in my case, the GeForce GTX 1050 I am using, has 6 SMs.
The setup for this case is the same as the previous article: 5 time-steps and 100K particles. This will allow me to compare the performance of CUDA with OpenACC.
Now running compiling and running with the profiling tool of NVIDIA: Nvprof we have:
$ nvcc -x cu main-cuda.cpp -o main-cuda -O3 $ nvprof ./main-cuda ... Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 37.5210s 5 7.50419s 7.45996s 7.52978s integrateKernel(Particle*, int, double, double, double) 0.00% 1.3059ms 1 1.3059ms 1.3059ms 1.3059ms [CUDA memcpy HtoD] API calls: 99.50% 37.5210s 1 37.5210s 37.5210s 37.5210s cudaFree 0.49% 185.88ms 1 185.88ms 185.88ms 185.88ms cudaMalloc 0.00% 1.4589ms 1 1.4589ms 1.4589ms 1.4589ms cudaMemcpy ..
Note that 100 % of the time (rounded) computations are localized in the integration() function (or the CUDA kernel). This means that the data movement from host to device and vice versa is not the main concern in this particular algorithm. This is explained since we are sending N particles, or O(N) data, to perform an equivalent of O(N^2) computations. In a more efficient algorithm probably the data movement would start having more contribution to the total execution time.
The next figure summarizes the results of the CUDA execution in comparison to our single-core CPU, the OpenMP and the OpenACC executions obtained in the previous articles:
Note that the speedup with CUDA is lightly better than OpenACC. I didn't optimize too much the CUDA kernel since my first objective was to port the code and get at least a performance better or similar to OpenACC. Even I can dedicate more time to optimize the CUDA part is not obvious that I will be able to improve it. OpenACC, in this case, has a nice game since the performance is almost the same as CUDA which much less programming effort. We can expect more of OpenACC in the future, and also of OpenMP which now allows offloading code to GPUs.
Front-Office Lead Software Engineer at J.P. Morgan ? Country count: 33
4 年In my experience, adding thread scheduling to the study yields pretty interesting insights.