NVIDIA GPU Microarchitecture
Daniel Attali
4th Year Software Engineering Student at JCT | Data Science & AI Specialisation | C++ CUDA Engineer
A Top-Down Approach
Let's look at the microarchitecture of the GPUs that @NVIDIA produces. This will give us a few things:
The GPU
Let's take a look at the GPU. You might know how it looks from the outside, but have you ever looked at the inside?
In the image above we can see the board that contains the chip and other electronics on it, we will focus on the chip.
In the diagram above we zoom in on the chip in the center of the board. We can see that on the chip there are 2 main components:
VRAM - Global Memory
Global Memory is the main memory for the GPU, it is the biggest one, and it is also the slowest one. The size of the VRAM depends on the specific GPU card but can go any from a few GB to a dozen GB (for example on the A100 chip there is 12GB). It is used in any CUDA program. Understanding the speed and constraint of the global memory will open you to better coding thinking and technique.
GPU Chip
Again we zoomed in on the Chip part of the GPU and now we can see again the chip is divided into 2 parts:
Note - on many CPU we can find 3 levels of caches denoted L1 ,L2, and L3. But on the GPU don't have the L3 cache.
L2 Cache
L2 cache is a smaller but faster memory than global memory. It is very limited in space generally in the range of MB (in the RTX 40 series 64MB) depending on the specific card and architecture.
GPC (Graphic Processing Cluster)
In the diagram above we can see that inside the GPC we have many TPCs (Tensor Processing Cluster). The TPC contains many SMs (Streaming Multiprocessor). The SMs are the building blocks of any CUDA program.
SM (Streaming Multiprocessor)
NVIDIA GPUs contain many SMs and as we said they are the building blocks of any CUDA program.
领英推荐
As we can see the SM contains many parts, let's take a look:
Processing Block
Inside the Processing Bloc we can see 3 main types of components:
In total, we see that we have 32 cores, a collection of 32 processing cores is called a Wrap.
Summary
The physical hierarchy of cores is:
To know how many CUDA cores a GPU has the formula is:
CUDA Cores = GPC per GPU x TPC per GPC x SM per TPC x Wrap per SM x 32
This can vary from architecture to architecture and from generation to generation.
The programming thread hierarchy is:
And CUDA is a very powerful API that can adapt the program to the physical GPU.
As you can see in the image depending on the amount of SMs present on the physical GPU that the CUDA program runs on CUDA will decide how to arrange the grid of blocks.
In this article, we took a deeper look at the physical design of a GPU and this gave us a better understanding of how to think about our programming in CUDA.
If you have anything to add or if I made a mistake please make sure to comment and I will try and fix it.
#cuda #nvidia #microarchitecture #gpu #hpc
4th Year Software Engineering Student at JCT | Data Science & AI Specialisation | C++ CUDA Engineer
3 个月please like and comment what do you think?