Followup to my GPUs as Processors Article
In my earlier post where I talked about thinking of GPUs are Processors instead of Co-Processors, I mentioned I would follow up with some of the challenges we faced in our recent project. This is the promised follow up along with some thoughts on how I feel things should look like in the next few years. I focus on Nvidia's GPU technology, although these thoughts do generalize.
Poor compiler optimizations: The nvcc compiler works best with small-sized kernels that fit within a single .cu file (with everything else in header files). It is unable to do a good job when the code complexity and/or the number of .cu files increase. For example, I once inserted a no-op loop - i.e., replaced
A; B; C;
with
A; for (int i = 0; i < 1; ++i) {B}; C;
resulting in a program that ran measurably faster. This should never happen if the compiler knows what it is doing. I have a few other interesting examples - all of which encourages us to write multiple simpler kernels, and/or insert PTX/assembly code to compensate for the compiler limitations. Very similar to how CPU compilers were in the 80's.
Limited availability of libraries: Pretty much all libraries today are provided as kernels - e.g., CUBLAS. There are a few exceptions such as inplace_vector and barrier. But even very basic functionality such as maps are not easily available. For random number generation, we have CURAND, but the usage is different depending on whether it is used from CPU or GPU. What we need is a rich family of libraries from multiple sources that can be used in the same manner whether from CPU or GPU.
The SIMT model needs to be generalized: The CUDA programming model has all blocks and threads of a kernel running the same code. However the only hard requirement (for good performance) is that all threads within a warp run exactly the same code without any divergence. As we start thinking of GPUs more like processors than co-processors, we need to be able to run heterogeneous tasks at the same time on the GPU. The CUDA Progamming Guide already provides such examples (e.g., producer-consumer) that runs completely different code in different warps - albeit using if statements to achieve this (this feels like a hack). Why not create an abstraction (something between blocks and threads) for this?
Register allocation: Currently all threads in all blocks in a kernel are allocated the same number of registers. This makes sense only when all threads are doing the exact same work (perfect SIMT). However, when we use GPUs to perform heterogeneous tasks (like in the producer-consumer example) we will need to optimize them in very different ways. For example, the producer may require far fewer registers than the consumer. Today (unfortunately) we are constrained to come up with a single register allocation that balances the producer and consumer performance.
Better scheduling of work across SMs (streaming multiprocessors): We write kernels today to have a certain number of blocks, each of which contain the same number of threads. The warps within a block are scheduled on a single SM and they can be swapped in and out to make efficient use of the SM (when one warp is blocked on memory reads, another warp that is ready to run can be swapped in). This also allows for easy synchronization across threads within a block (using __syncthreads()). However, this model does not extend to running blocks across multiple SMs. Once blocks are assigned to SMs they have to run to completion before another pending block can be assigned the same resources. Which means one has to be very careful when coming up with the right number of blocks for each GPU kind. For example, if there are too many blocks and we have synchronization across blocks, we will end up with a deadlock. It would be really nice if the programmer did not have to care about how blocks are scheduled onto SMs. Obviously the expert programmers will still care - just as we do in CPU programming in crafting the optimal number of threads for best performance.
This is a quick list that I have written up (and excluded a few that are more difficult for me to explain). I do realize there are big challenges in making all of this happen, but if you look at the advances in computing hardware and software in the almost 50 years that I have been actively programming, this is not asking for much. In many ways, a lot of this mirrors my experiences with CPU programming in the 80's - with so many innovations and deviations (CISC, RISC, SIMD, MIMD, different kinds of cache coherence, etc.) and then all converging (for the most part) to a much more standardized CPU programming paradigm.