NVIDIA GPU Microarchitecture

NVIDIA GPU Microarchitecture

A Top-Down Approach

Let's look at the microarchitecture of the GPUs that @NVIDIA produces. This will give us a few things:


  1. An understanding of the inner workings of the GPU.
  2. An understanding of the different memory types and how should we use them.
  3. An understanding of the threads hierarchy in the CUDA programming model.


The GPU

Let's take a look at the GPU. You might know how it looks from the outside, but have you ever looked at the inside?

In the image above we can see the board that contains the chip and other electronics on it, we will focus on the chip.


In the diagram above we zoom in on the chip in the center of the board. We can see that on the chip there are 2 main components:

  1. VRAM - In the CUDA programming model is called Global Memory.
  2. Chip - This is where the code will run.

VRAM - Global Memory

Global Memory is the main memory for the GPU, it is the biggest one, and it is also the slowest one. The size of the VRAM depends on the specific GPU card but can go any from a few GB to a dozen GB (for example on the A100 chip there is 12GB). It is used in any CUDA program. Understanding the speed and constraint of the global memory will open you to better coding thinking and technique.

GPU Chip

Again we zoomed in on the Chip part of the GPU and now we can see again the chip is divided into 2 parts:

  1. L2 Cache - the level 2 cache for the memory.
  2. GPC (Graphic Processing Cluster) - is a grouping of smaller components.

Note - on many CPU we can find 3 levels of caches denoted L1 ,L2, and L3. But on the GPU don't have the L3 cache.

L2 Cache

L2 cache is a smaller but faster memory than global memory. It is very limited in space generally in the range of MB (in the RTX 40 series 64MB) depending on the specific card and architecture.

GPC (Graphic Processing Cluster)


In the diagram above we can see that inside the GPC we have many TPCs (Tensor Processing Cluster). The TPC contains many SMs (Streaming Multiprocessor). The SMs are the building blocks of any CUDA program.

SM (Streaming Multiprocessor)

NVIDIA GPUs contain many SMs and as we said they are the building blocks of any CUDA program.

As we can see the SM contains many parts, let's take a look:

  1. L1 Cache - The level 1 cache sits in each SM and is shared by all the different parts of the SM. In the CUDA programming model, we call it shared memory because it is shared between many threads in the same block.
  2. Register File - High-speed storage for each thread's data. We can see from the image that since the registers are outside the processing cores we have a finite amount of registers that need to be shared by many different threads.
  3. Ray Tracing Core - Dedicated core for making ray-tracing operations (we will not focus on them).
  4. Wrap - A collection of processing cores.

Processing Block


Inside the Processing Bloc we can see 3 main types of components:

  1. Tenser Core - To make tensor operations such as A x B + C (A, B, C are tensors).
  2. FP32 - Floating Point 32bit core x16, to make a x b + c operations on data of type float32.
  3. FP32/I32 - x16 Cores that are capable of processing data of type float32 or int32.

In total, we see that we have 32 cores, a collection of 32 processing cores is called a Wrap.


Summary

The physical hierarchy of cores is:

  1. GPU
  2. GPC
  3. TCP
  4. SM
  5. Wrap
  6. Core

To know how many CUDA cores a GPU has the formula is:

CUDA Cores = GPC per GPU x TPC per GPC x SM per TPC x Wrap per SM x 32

This can vary from architecture to architecture and from generation to generation.


The programming thread hierarchy is:

  1. Grid
  2. Block
  3. Thread

And CUDA is a very powerful API that can adapt the program to the physical GPU.


As you can see in the image depending on the amount of SMs present on the physical GPU that the CUDA program runs on CUDA will decide how to arrange the grid of blocks.


In this article, we took a deeper look at the physical design of a GPU and this gave us a better understanding of how to think about our programming in CUDA.


If you have anything to add or if I made a mistake please make sure to comment and I will try and fix it.


#cuda #nvidia #microarchitecture #gpu #hpc

Daniel Attali

4th Year Software Engineering Student at JCT | Data Science & AI Specialisation | C++ CUDA Engineer

3 个月

please like and comment what do you think?

要查看或添加评论,请登录

Daniel Attali的更多文章

  • CPU vs. GPU: Understanding the Architecture

    CPU vs. GPU: Understanding the Architecture

    Let's dive into the differences in architecture between CPUs and GPUs, as depicted in the image, and how they…

    4 条评论
  • The Need for JavaScript Frameworks: A Deep Dive into React

    The Need for JavaScript Frameworks: A Deep Dive into React

    JavaScript frameworks have revolutionized web development, making building complex, interactive applications easier…

    7 条评论
  • Monolith VS Microservices

    Monolith VS Microservices

    Understanding the Difference Between the Architectures Monolithic and microservices architectures are two approaches to…

    2 条评论
  • A Beginner's Guide to Docker: What, Why, and How to Get Started

    A Beginner's Guide to Docker: What, Why, and How to Get Started

    ?? Introduction Docker has revolutionized the way developers build, ship, and run applications. If you're new to…

    1 条评论
  • Rust ?? - Unlock The Power Within The Safty

    Rust ?? - Unlock The Power Within The Safty

    Why Did the White House Advise Programmers to Stop Using C/C++? Let's start by introducing the Rust Programming…

    1 条评论
  • iOS App Project????

    iOS App Project????

    ?? Excited to share my latest project: the UX-Prototype for a Course Management Application! This project is a…

社区洞察

其他会员也浏览了