NVIDIA GTC Impressions - A Data Center Perspective: "It was my Understanding that there Would be no Math"
Nvidia / Benj Edwards

NVIDIA GTC Impressions - A Data Center Perspective: "It was my Understanding that there Would be no Math"

NVIDIA GTC Impressions - A Data Center Perspective: "It was my Understanding that there Would be no Math"

There are plenty of thought pieces being written about last week’s iconic? NVIDIA GTC Conference, but few if any come from the perspective of the data centers.?

To start with, a quick recap of the new GPU specifications - aka “The Math”.?

NVIDIA is continuing their evolution from the A100 (last generation) to the H100 (current generation) to the next-gen GB200, their GPU system which will combine two B200 Blackwell GPUs with one Grace Hopper CPU, to provide four times the performance of the last generation at an approximately 30% power density increase and in the same space. On a power basis, that’s about a 310% performance:power ratio increase.

A single rack would have 36 Hopper CPUs and 72 Blackwell GPUs in the standard configuration and would take 120kW/cabinet, although whispers indicate that in operation it may be a bit less - call it 110kW/cabinet. Within the rack, copper is used as the networking medium - a pretty standard approach considering that optics are unnecessary and would add an additional 20kW of power.?

Eight racks form a row - 288 Hoppers and 576 Blackwells, connected by 576 NVLink Smart NICs in a fully meshed Infiniband configuration. That sounds good, except that lead times on Mellanox NICs are increasing and currently stand at 10 to 20 weeks - the critical path in any NVIDIA GPU deployment. Each eight rack row would be supported by one Cooling Distribution Unit (CDU), connected to building chilled water.?

32 racks for a standard sized deployment pod of 4MW. That being said, I’m skeptical that 4MW is a true unit of deployment - I suspect four of these pods will be deployed together, making the minimum deployment size 16MW. In comparison, the last generation H100 was deployed in an 8000 GPU 15MW pod. Any way you cut it, this is 4x as good per GPU and 3.1x as good per unit Power.?

So, what does this do to power projections for GPU hosting data centers? Honestly, not much. The increase in both power and performance was anticipated in my analysis and we are holding steady on 31,000MW of global GPU data center demand in 2028. That is a highly conservative number - our friends at Semi Analysis have published a much higher projection of 80,000MW, a power figure that we feel can not be met with either likely chip shipments or utility power.?

NVIDIA is also shipping a slightly “toned down” B100-powered drop in replacement for H100 racks, each GPU clocking in at 700kW instead of 1000kW. NVIDIA also pushed their Spectrum-X Ethernet network as a possible alternative to the Mellanox Infiniband. Plenty of GPU hosters, however, are wondering why not just deploy 400G Ethernet with less expensive Smart NICs. It’s a good question to ask, in my opinion.?

So, what’s the data center impact of these titanic announcements?

  • Power density per rack will continue to increase. It's clearly not popping to levels requiring immersion, however - ~240kw/cabinet.?
  • You can deploy these higher power GPUs in a de-populated configuration - fewer GPUs per rack - to maintain 40kW/cabinet air-cooled environments for applications like inference.?
  • Our existing data center portfolio is not obsolete, so long as we are careful. Also, we can future proof our current designs - again, so long as we are careful
  • Data center providers and developers must focus on strategy now, not simply business development. Location, latency, power density, and ability to handle future technologies are far more important than they were, 24 months ago.?
  • We should assume future power increases will be similar, but that future performance increases will be somewhat more modest. In addition, I’m expressing serious skepticism about the ability of NVIDIA to do this on an annual basis - spinning chips is hard, and it almost broke NVIDIA’s team to ship the H200 on schedule. I expect an 18 month cycle instead of a 12 month cycle.?
  • For datacenter sizing… this doesn’t really move the needle much. Greater density of GPU/FLOPS isn’t going to make data centers larger, in a power sense, nor is it going to make them much smaller in a size sense (excepting that server floors have already been shrinking to 20kW/cabinet with a “cabinet equivalent” of about 30 sq ft). The need for dual-use cloud/GPU data centers is too high for much more “squeezing”. We’re still seeing average new data center sizing of ~60MW critical and campuses of ~200MW minimum, 300MW optimum.?

What does all of this mean in regard to depreciation life of GPUs? A100s had (and have) a clear six year depreciation life, which has been pretty great for the GPU hosters and others. Will H100s also have that long life? That really depends on how many of the more modern blackwell systems than NVIDIA can ship and how quickly. But six years seems unlikely. Many GPU hosters are planning for two years of lifetime. The economics get a lot better at four years, but this would require the ability to shift older GPUs from remote training locations to closer-in inference locations. This is not easy and it's far outside the wheelhouse of the vast majority of those in this sector. No one has seriously tried lift-and-shift for a decade or more, because it's horribly difficult and expensive. Perhaps it's time to redevelop this capability.?

Finally, if you are in the data center business, you should be at GTC. Some very influential data center executives were there, but I was surprised to see how many engineers were not. This isn’t optional anymore - it's not just an interesting application. ML is the future for data centers. Next year’s GTC will supposedly be in Las Vegas due to the size of the crowds. I hope to see you there - it's vital for your business.?



Thank you Dan

回复

Great article and was great to meet you there! Hope to catch up again soon!

回复

Tremendous, as this reader has come to expect of your erudite musings

Ross Johnson DCIE DCES

SVP Amentum Digital Infrastructure

11 个月

See u in Vegas!

Aaron Russell

Director, Edge Network Infrastructure

11 个月

Great readout

要查看或添加评论,请登录

Daniel Golding的更多文章

  • OpenAI RFP Blues: A Blunt Instrument of Datacenter Analysis

    OpenAI RFP Blues: A Blunt Instrument of Datacenter Analysis

    Our friends at OpenAI recently published the StarGate project RFP (see here: https://cdn.openai.

    10 条评论
  • The Age of Fake Datacenters Part 3 - Customers Speak Up

    The Age of Fake Datacenters Part 3 - Customers Speak Up

    [The Age of Fake Datacenters and The Age of Fake Data Centers, Part 2 - Faster and Faker] Han Solo's take on Fake Data…

    31 条评论
  • Deepseek: Is the Data Center Industry Deep Sunk?

    Deepseek: Is the Data Center Industry Deep Sunk?

    The last several days have seen financial markets and pundits losing their collective minds over several recent AI…

    57 条评论
  • Time to Restart the Clock

    Time to Restart the Clock

    Is the data center industry concluding its collective multi-year freak-out? To be clear about what I mean by that -…

    22 条评论
  • The Age of Fake Data Centers, Part 2 - Faster and Faker

    The Age of Fake Data Centers, Part 2 - Faster and Faker

    In my recent research note, The Age of Fake Datacenters, I talked about the large number of data center projects that…

    90 条评论
  • Nuclear Data Centers Workshop - Part 2

    Nuclear Data Centers Workshop - Part 2

    Normally when one thinks of the hottest tickets in the datacenter world, one might be referring to the Illuminati-like…

    18 条评论
  • INL Nuclear Power for Datacenters Workshop - Part 1

    INL Nuclear Power for Datacenters Workshop - Part 1

    Idaho National Laboratory sits 50 miles out, past the small town of Idaho Falls, Idaho and deep into the high desert…

    4 条评论
  • The Age of Fake Datacenters

    The Age of Fake Datacenters

    There has been vaporware software and even phony wars, but has anyone heard of a fake datacenter? (I may not be a…

    31 条评论
  • “Behind the Meter” may mean “Front of the Line”

    “Behind the Meter” may mean “Front of the Line”

    There is a tremendous amount of discussion about "behind the meter" as well as "energy storage" and other power…

    23 条评论
  • Goldman Sachs: Wrong on AI and Wrong on the Future

    Goldman Sachs: Wrong on AI and Wrong on the Future

    Goldman Sachs recently published a research note, bemoaning the lack of returns on current and future AI investments…

    33 条评论

社区洞察

其他会员也浏览了