登录查看更多内容

NVIDIA GPU Microarchitecture

Daniel Attali

4th Year Software Engineering Student at JCT | Data Science & AI Specialisation | C++ CUDA Engineer

发布日期: 2024年11月12日

+ 关注

A Top-Down Approach

Let's look at the microarchitecture of the GPUs that @NVIDIA produces. This will give us a few things:

An understanding of the inner workings of the GPU.
An understanding of the different memory types and how should we use them.
An understanding of the threads hierarchy in the CUDA programming model.

The GPU

Let's take a look at the GPU. You might know how it looks from the outside, but have you ever looked at the inside?

In the image above we can see the board that contains the chip and other electronics on it, we will focus on the chip.

In the diagram above we zoom in on the chip in the center of the board. We can see that on the chip there are 2 main components:

VRAM - In the CUDA programming model is called Global Memory.
Chip - This is where the code will run.

VRAM - Global Memory

Global Memory is the main memory for the GPU, it is the biggest one, and it is also the slowest one. The size of the VRAM depends on the specific GPU card but can go any from a few GB to a dozen GB (for example on the A100 chip there is 12GB). It is used in any CUDA program. Understanding the speed and constraint of the global memory will open you to better coding thinking and technique.

GPU Chip

Again we zoomed in on the Chip part of the GPU and now we can see again the chip is divided into 2 parts:

L2 Cache - the level 2 cache for the memory.
GPC (Graphic Processing Cluster) - is a grouping of smaller components.

Note - on many CPU we can find 3 levels of caches denoted L1 ,L2, and L3. But on the GPU don't have the L3 cache.

L2 Cache

L2 cache is a smaller but faster memory than global memory. It is very limited in space generally in the range of MB (in the RTX 40 series 64MB) depending on the specific card and architecture.

GPC (Graphic Processing Cluster)

In the diagram above we can see that inside the GPC we have many TPCs (Tensor Processing Cluster). The TPC contains many SMs (Streaming Multiprocessor). The SMs are the building blocks of any CUDA program.

SM (Streaming Multiprocessor)

NVIDIA GPUs contain many SMs and as we said they are the building blocks of any CUDA program.

领英推荐

Accelerating Generative AI: NVIDIA's CUDA Reinvents HPC

NEBUL | European Private AI 1 年前

Comparing NVIDIA GPUs for AI: T4 vs A10

Baseten 1 年前

GPU Operator: Simplifying GPU Management on Kubernetes

INI8 LABS 1 个月前

As we can see the SM contains many parts, let's take a look:

L1 Cache - The level 1 cache sits in each SM and is shared by all the different parts of the SM. In the CUDA programming model, we call it shared memory because it is shared between many threads in the same block.
Register File - High-speed storage for each thread's data. We can see from the image that since the registers are outside the processing cores we have a finite amount of registers that need to be shared by many different threads.
Ray Tracing Core - Dedicated core for making ray-tracing operations (we will not focus on them).
Wrap - A collection of processing cores.

Processing Block

Inside the Processing Bloc we can see 3 main types of components:

Tenser Core - To make tensor operations such as A x B + C (A, B, C are tensors).
FP32 - Floating Point 32bit core x16, to make a x b + c operations on data of type float32.
FP32/I32 - x16 Cores that are capable of processing data of type float32 or int32.

In total, we see that we have 32 cores, a collection of 32 processing cores is called a Wrap.

Summary

The physical hierarchy of cores is:

GPU
GPC
TCP
SM
Wrap
Core

To know how many CUDA cores a GPU has the formula is:

CUDA Cores = GPC per GPU x TPC per GPC x SM per TPC x Wrap per SM x 32

This can vary from architecture to architecture and from generation to generation.

The programming thread hierarchy is:

Grid
Block
Thread

And CUDA is a very powerful API that can adapt the program to the physical GPU.

As you can see in the image depending on the amount of SMs present on the physical GPU that the CUDA program runs on CUDA will decide how to arrange the grid of blocks.

In this article, we took a deeper look at the physical design of a GPU and this gave us a better understanding of how to think about our programming in CUDA.

If you have anything to add or if I made a mistake please make sure to comment and I will try and fix it.

#cuda #nvidia #microarchitecture #gpu #hpc

Daniel Attali

4th Year Software Engineering Student at JCT | Data Science & AI Specialisation | C++ CUDA Engineer

3 个月

please like and comment what do you think?

1 次回应

要查看或添加评论，请登录

Daniel Attali的更多文章

CPU vs. GPU: Understanding the Architecture

2024年7月16日

CPU vs. GPU: Understanding the Architecture

Let's dive into the differences in architecture between CPUs and GPUs, as depicted in the image, and how they…

4 条评论
The Need for JavaScript Frameworks: A Deep Dive into React

2024年7月7日

The Need for JavaScript Frameworks: A Deep Dive into React

JavaScript frameworks have revolutionized web development, making building complex, interactive applications easier…

7 条评论
Monolith VS Microservices

2024年7月2日

Monolith VS Microservices

Understanding the Difference Between the Architectures Monolithic and microservices architectures are two approaches to…

2 条评论
A Beginner's Guide to Docker: What, Why, and How to Get Started

2024年6月30日

A Beginner's Guide to Docker: What, Why, and How to Get Started

?? Introduction Docker has revolutionized the way developers build, ship, and run applications. If you're new to…

1 条评论
Rust ?? - Unlock The Power Within The Safty

2024年6月27日

Rust ?? - Unlock The Power Within The Safty

Why Did the White House Advise Programmers to Stop Using C/C++? Let's start by introducing the Rust Programming…

1 条评论
iOS App Project????

2023年10月30日

iOS App Project????

?? Excited to share my latest project: the UX-Prototype for a Course Management Application! This project is a…

See all articles

NVIDIA GPU Microarchitecture

Daniel Attali

4th Year Software Engineering Student at JCT | Data Science & AI Specialisation | C++ CUDA Engineer

A Top-Down Approach

The GPU

VRAM - Global Memory

GPU Chip

L2 Cache

GPC (Graphic Processing Cluster)

SM (Streaming Multiprocessor)

领英推荐

Processing Block

Summary

Daniel Attali的更多文章

社区洞察

其他会员也浏览了

NVIDIA A100 vs V100: How do they?compare?

NVIDIA H100 vs H200: How Will They Compare?

A100/H100 is too expensive, why not use 4090?

Nvidia GPU Architectures Details Explained by KUKE

Demystifying CPU vs. GPU: Understanding the Key Differences

NVIDIA's Blackwell Architecture: Redefining the Future of AI and Accelerated Computing

What is the GPGPU, the King of AI Computing Chips?

What is the difference between GPU and CPU in AI and Machine Learning?

Nvidia Jetson vs. Mac Unified Memory

NVIDIA MGX: Next-Gen Architecture for Accelerated Computing

A Top-Down Approach

The GPU

VRAM - Global Memory

GPU Chip

L2 Cache

GPC (Graphic Processing Cluster)

SM (Streaming Multiprocessor)

领英推荐

Processing Block

Summary

Daniel Attali的更多文章

CPU vs. GPU: Understanding the Architecture

The Need for JavaScript Frameworks: A Deep Dive into React

Monolith VS Microservices

A Beginner's Guide to Docker: What, Why, and How to Get Started

Rust ?? - Unlock The Power Within The Safty

iOS App Project????

社区洞察

其他会员也浏览了

NVIDIA A100 vs V100: How do they?compare?

NVIDIA H100 vs H200: How Will They Compare?

A100/H100 is too expensive, why not use 4090?

Nvidia GPU Architectures Details Explained by KUKE

Demystifying CPU vs. GPU: Understanding the Key Differences

NVIDIA's Blackwell Architecture: Redefining the Future of AI and Accelerated Computing

What is the GPGPU, the King of AI Computing Chips?

What is the difference between GPU and CPU in AI and Machine Learning?

Nvidia Jetson vs. Mac Unified Memory

NVIDIA MGX: Next-Gen Architecture for Accelerated Computing