登录查看更多内容

NVIDIA GTC Impressions - A Data Center Perspective: "It was my Understanding that there Would be no Math"

Daniel Golding

Datacenter, AI/ML, and Energy Hyperscale Executive Leader

发布日期: 2024年3月26日

NVIDIA GTC Impressions - A Data Center Perspective: "It was my Understanding that there Would be no Math"

There are plenty of thought pieces being written about last week’s iconic? NVIDIA GTC Conference, but few if any come from the perspective of the data centers.?

To start with, a quick recap of the new GPU specifications - aka “The Math”.?

NVIDIA is continuing their evolution from the A100 (last generation) to the H100 (current generation) to the next-gen GB200, their GPU system which will combine two B200 Blackwell GPUs with one Grace Hopper CPU, to provide four times the performance of the last generation at an approximately 30% power density increase and in the same space. On a power basis, that’s about a 310% performance:power ratio increase.

A single rack would have 36 Hopper CPUs and 72 Blackwell GPUs in the standard configuration and would take 120kW/cabinet, although whispers indicate that in operation it may be a bit less - call it 110kW/cabinet. Within the rack, copper is used as the networking medium - a pretty standard approach considering that optics are unnecessary and would add an additional 20kW of power.?

Eight racks form a row - 288 Hoppers and 576 Blackwells, connected by 576 NVLink Smart NICs in a fully meshed Infiniband configuration. That sounds good, except that lead times on Mellanox NICs are increasing and currently stand at 10 to 20 weeks - the critical path in any NVIDIA GPU deployment. Each eight rack row would be supported by one Cooling Distribution Unit (CDU), connected to building chilled water.?

32 racks for a standard sized deployment pod of 4MW. That being said, I’m skeptical that 4MW is a true unit of deployment - I suspect four of these pods will be deployed together, making the minimum deployment size 16MW. In comparison, the last generation H100 was deployed in an 8000 GPU 15MW pod. Any way you cut it, this is 4x as good per GPU and 3.1x as good per unit Power.?

So, what does this do to power projections for GPU hosting data centers? Honestly, not much. The increase in both power and performance was anticipated in my analysis and we are holding steady on 31,000MW of global GPU data center demand in 2028. That is a highly conservative number - our friends at Semi Analysis have published a much higher projection of 80,000MW, a power figure that we feel can not be met with either likely chip shipments or utility power.?

领英推荐

DDN Expands Support for NVIDIA Technology to Enable AI…

DDN 9 个月前

Training Large Vision Models (LVMs): Benchmarking AMD…

LandingAI 10 个月前

AI Power Consumption: Rapidly Becoming Mission-Critical

I/O Fund 8 个月前

NVIDIA is also shipping a slightly “toned down” B100-powered drop in replacement for H100 racks, each GPU clocking in at 700kW instead of 1000kW. NVIDIA also pushed their Spectrum-X Ethernet network as a possible alternative to the Mellanox Infiniband. Plenty of GPU hosters, however, are wondering why not just deploy 400G Ethernet with less expensive Smart NICs. It’s a good question to ask, in my opinion.?

So, what’s the data center impact of these titanic announcements?

Power density per rack will continue to increase. It's clearly not popping to levels requiring immersion, however - ~240kw/cabinet.?
You can deploy these higher power GPUs in a de-populated configuration - fewer GPUs per rack - to maintain 40kW/cabinet air-cooled environments for applications like inference.?
Our existing data center portfolio is not obsolete, so long as we are careful. Also, we can future proof our current designs - again, so long as we are careful
Data center providers and developers must focus on strategy now, not simply business development. Location, latency, power density, and ability to handle future technologies are far more important than they were, 24 months ago.?
We should assume future power increases will be similar, but that future performance increases will be somewhat more modest. In addition, I’m expressing serious skepticism about the ability of NVIDIA to do this on an annual basis - spinning chips is hard, and it almost broke NVIDIA’s team to ship the H200 on schedule. I expect an 18 month cycle instead of a 12 month cycle.?
For datacenter sizing… this doesn’t really move the needle much. Greater density of GPU/FLOPS isn’t going to make data centers larger, in a power sense, nor is it going to make them much smaller in a size sense (excepting that server floors have already been shrinking to 20kW/cabinet with a “cabinet equivalent” of about 30 sq ft). The need for dual-use cloud/GPU data centers is too high for much more “squeezing”. We’re still seeing average new data center sizing of ~60MW critical and campuses of ~200MW minimum, 300MW optimum.?

What does all of this mean in regard to depreciation life of GPUs? A100s had (and have) a clear six year depreciation life, which has been pretty great for the GPU hosters and others. Will H100s also have that long life? That really depends on how many of the more modern blackwell systems than NVIDIA can ship and how quickly. But six years seems unlikely. Many GPU hosters are planning for two years of lifetime. The economics get a lot better at four years, but this would require the ability to shift older GPUs from remote training locations to closer-in inference locations. This is not easy and it's far outside the wheelhouse of the vast majority of those in this sector. No one has seriously tried lift-and-shift for a decade or more, because it's horribly difficult and expensive. Perhaps it's time to redevelop this capability.?

Finally, if you are in the data center business, you should be at GTC. Some very influential data center executives were there, but I was surprised to see how many engineers were not. This isn’t optional anymore - it's not just an interesting application. ML is the future for data centers. Next year’s GTC will supposedly be in Las Vegas due to the size of the crowds. I hope to see you there - it's vital for your business.?

Chris Cleghorn

11 个月

Thank you Dan

Tito Costa

11 个月

Great article and was great to meet you there! Hope to catch up again soon!

Brian Doricko

11 个月

Tremendous, as this reader has come to expect of your erudite musings

1 次回应

Ross Johnson DCIE DCES

SVP Amentum Digital Infrastructure

11 个月

See u in Vegas!

1 次回应

Aaron Russell

Director, Edge Network Infrastructure

11 个月

Great readout

1 次回应

查看更多评论

要查看或添加评论，请登录

Daniel Golding的更多文章

OpenAI RFP Blues: A Blunt Instrument of Datacenter Analysis

2025年2月14日

OpenAI RFP Blues: A Blunt Instrument of Datacenter Analysis

Our friends at OpenAI recently published the StarGate project RFP (see here: https://cdn.openai.

10 条评论
The Age of Fake Datacenters Part 3 - Customers Speak Up

2025年2月5日

The Age of Fake Datacenters Part 3 - Customers Speak Up

[The Age of Fake Datacenters and The Age of Fake Data Centers, Part 2 - Faster and Faker] Han Solo's take on Fake Data…

31 条评论
Deepseek: Is the Data Center Industry Deep Sunk?

2025年1月30日

Deepseek: Is the Data Center Industry Deep Sunk?

The last several days have seen financial markets and pundits losing their collective minds over several recent AI…

57 条评论
Time to Restart the Clock

2025年1月28日

Time to Restart the Clock

Is the data center industry concluding its collective multi-year freak-out? To be clear about what I mean by that -…

22 条评论
The Age of Fake Data Centers, Part 2 - Faster and Faker

2024年11月21日

The Age of Fake Data Centers, Part 2 - Faster and Faker

In my recent research note, The Age of Fake Datacenters, I talked about the large number of data center projects that…

90 条评论
Nuclear Data Centers Workshop - Part 2

2024年10月30日

Nuclear Data Centers Workshop - Part 2

Normally when one thinks of the hottest tickets in the datacenter world, one might be referring to the Illuminati-like…

18 条评论
INL Nuclear Power for Datacenters Workshop - Part 1

2024年10月29日

INL Nuclear Power for Datacenters Workshop - Part 1

Idaho National Laboratory sits 50 miles out, past the small town of Idaho Falls, Idaho and deep into the high desert…

4 条评论
The Age of Fake Datacenters

2024年9月9日

The Age of Fake Datacenters

There has been vaporware software and even phony wars, but has anyone heard of a fake datacenter? (I may not be a…

31 条评论
“Behind the Meter” may mean “Front of the Line”

2024年9月4日

“Behind the Meter” may mean “Front of the Line”

There is a tremendous amount of discussion about "behind the meter" as well as "energy storage" and other power…

23 条评论
Goldman Sachs: Wrong on AI and Wrong on the Future

2024年7月10日

Goldman Sachs: Wrong on AI and Wrong on the Future

Goldman Sachs recently published a research note, bemoaning the lack of returns on current and future AI investments…

33 条评论

See all articles

NVIDIA GTC Impressions - A Data Center Perspective: "It was my Understanding that there Would be no Math"

Daniel Golding

Datacenter, AI/ML, and Energy Hyperscale Executive Leader

NVIDIA GTC Impressions - A Data Center Perspective: "It was my Understanding that there Would be no Math"

领英推荐

Daniel Golding的更多文章

社区洞察

其他会员也浏览了

Comparing NVIDIA GPUs for AI: T4 vs A10

NVIDIA’s Blackwell Architecture: Breaking Down The B100, B200, and GB200

NVIDIA A100 vs V100: How do they?compare?

NVIDIA H100 vs H200: How Will They Compare?

Shh… CUDA at Work!

Nvidia Reveals the ‘World’s Most Powerful’ AI Chip

Fueling the AI Revolution: NVIDIA's A100 and H100 GPUs

Watch and download presentations from the Q3 Memory Fabric Forum

Simplify, Optimise and Accelerate AI/ML

NVIDIA’S NEXT GENERATION GPU IS HERE! FT.BLACKWELL ARCHITECTURE (4X BETTER THAN HOPPER) | THEMVP

NVIDIA GTC Impressions - A Data Center Perspective: "It was my Understanding that there Would be no Math"

领英推荐

Daniel Golding的更多文章

OpenAI RFP Blues: A Blunt Instrument of Datacenter Analysis

The Age of Fake Datacenters Part 3 - Customers Speak Up

Deepseek: Is the Data Center Industry Deep Sunk?

Time to Restart the Clock

The Age of Fake Data Centers, Part 2 - Faster and Faker

Nuclear Data Centers Workshop - Part 2

INL Nuclear Power for Datacenters Workshop - Part 1

The Age of Fake Datacenters

“Behind the Meter” may mean “Front of the Line”

Goldman Sachs: Wrong on AI and Wrong on the Future

社区洞察

其他会员也浏览了

Comparing NVIDIA GPUs for AI: T4 vs A10

NVIDIA’s Blackwell Architecture: Breaking Down The B100, B200, and GB200

NVIDIA A100 vs V100: How do they?compare?

NVIDIA H100 vs H200: How Will They Compare?

Shh… CUDA at Work!

Nvidia Reveals the ‘World’s Most Powerful’ AI Chip

Fueling the AI Revolution: NVIDIA's A100 and H100 GPUs

Watch and download presentations from the Q3 Memory Fabric Forum

Simplify, Optimise and Accelerate AI/ML

NVIDIA’S NEXT GENERATION GPU IS HERE! FT.BLACKWELL ARCHITECTURE (4X BETTER THAN HOPPER) | THEMVP