登录查看更多内容

Beyond NVIDIA: Is AMD the only GPU alternative for HPC/AI Workloads

Nick Hume

Global Digital Infrastructure Executive | Sustainable AI & Liquid Cooling Authority | Podcast Host | ex-AMZN | ex-MSFT

发布日期: 2023年8月17日

In my last article, I discussed what a GPU was and primarily covered NVIDIAs history and product suite.?While NVIDIA GPUs have long been the go-to choice for these AI workloads, there have been a host of alternative options, both GPU and other, simmering away, mostly behind the scenes.

I wanted to shine a light on these, the current state of the market, and where the future might be headed.

So let's get stuck in…

GPU based: AMD and Intel

AMD GPUs

AMD, a formidable competitor in the CPU market, and arguably has the lion share of the CPU sales in the HPC market due to core count density and power efficiencies, has also made significant strides in the GPU space with its "Radeon Instinct" series. These GPUs offer an appealing alternative to NVIDIA's dominance, particularly in certain niches. Key points to consider include:

Architecture: AMD's RDNA architecture focuses on energy efficiency and scalability, making it suitable for both gaming and professional applications.
Heterogeneous Compute: AMD GPUs support a variety of programming models, including OpenCL and ROCm, enabling developers to harness their power for parallel processing tasks. AMD also joined PyTorch Foundation to further development of PyTorch, a computational framework based on Python.
Datacenter Integration: With initiatives like the AMD CDNA architecture, AMD aims to carve out a space for itself in data centers, providing competition to NVIDIA in HPC and AI workloads.

Launching initially in 2017 with a 150W card "MI6", AMD recently announced their latest 750W, 192GB beast, "Instinct MI300X" in production later this year.?

Disclaimer: comparing apples to apples is very difficult, workload dependent, and folks should always be skeptical of vendor-provided performance metrics.

Nvidia accused of cheating in big-data performance test by benchmark's umpires: Workloads 'tweaked' to beat rivals in TPCx-BB ? The Register

For comparison, NVIDIA's H100 has 80GB memory, requiring NVLink to pool GPUs to address higher levels of memory. It will be very interesting to see developers using this new platform, as AMD has a significant memory bandwidth and capacity advantage, with TFLOPs/TOPS yet to be announced. As software is rapidly improving, you will likely see more AMD GPU's in the wild later this year.

Intel GPUs

Intel, renowned for its CPUs, has also entered the GPU arena with its Intel Xe architecture. These GPUs bring a new dynamic to the market, offering several unique aspects:

Integration: Intel GPUs are designed to work in synergy with Intel CPUs, potentially optimizing system-level performance in HPC and AI setups.
OneAPI: Intel's OneAPI initiative strives to provide a unified programming model across its various hardware components, including GPUs, CPUs, and FPGAs, simplifying the development process.
Xe HPC: Intel's Xe HPC GPUs target the high-performance computing market, competing directly with NVIDIA's Tesla series.

It's important to note that Intel is a relative newcomer to the high end GPU market, having almost exclusively offered 'good-enough' integrated graphics in laptops and desktop parts for decades.?Their first-gen architecture, codenamed "Alchemist" has a SKU referred to as "Xe-HPC" was only seen in public two months back (Intel's Ponte Vecchio is Finally in The Wild | Tom's Hardware (tomshardware.com)) due to years of delays.

Intel's second-gen architecture, codenamed "Battlemage" has a SKU referred to as "Xe2-HPC" and may be released as "Realto Bridge" but according to recent reports, only likely based on "Enhanced Xe-HPC cores, not Xe2-HPC cores.?In parallel, expectations for their consumer GPUs based on Battlemage have significantly tampered down (Intel rumoured to be scaling back its next-gen Battlemage GPU | PC Gamer) for NVIDIA's last-gen mid-range GPU (released early 2022), for a product not due for release until mid-2024.

领英推荐

Comparing Apple’s Metal and NVIDIA’s CUDA: A…

Bojan Tunguz, Ph.D. 2 个月前

Maximizing AI Performance with Intel Arc A770 GPU on…

Plain Concepts 7 个月前

Unpacking AMD's $500 Billion AI Accelerator Forecast

Daniel Newman 5 个月前

Intel has a lot of work in front of them, and it appears that it will be many years to see IF they are able to bridge the gap at the high end of the market, or they remain in the low-mid performance (and cost) range.?

Fringe Alternatives: ASICs (TPUs) and FPGAs

Application-Specific Integrated Circuits (ASICs)

ASICs are custom-designed chips tailored to perform a specific task exceptionally efficiently. In the context of HPC and AI, ASICs can be optimized for specific workloads, yielding substantial performance benefits:

Efficiency: ASICs excel in power efficiency and performance for their designated tasks, making them suitable for data-centric applications.
Challenges: Developing ASICs requires significant time, effort, and resources. They are not easily reprogrammable, limiting their flexibility for rapidly evolving workloads.

Amazon's AI platforms (Trainium and Inferentia) are powered by Nitro chips.?Nitro is Amazon's custom ASIC (powered by their Annapurna acquisition) that offloads all kinds of tasks, including training and inferencing workloads. It's reported that every AWS server that ships comes with at least one Nitro chip.?

Another alternative ASIC is what Google calls their TPU (Tensor Processing Units).?Designed to address the unique demands of machine learning tasks, TPUs offer a specialized solution that are differentiated from traditional GPUs and other alternatives. Google's TPUs are trailblazers in AI acceleration, emphasizing performance, energy efficiency, and cloud-based accessibility.

Field-Programmable Gate Arrays (FPGAs)

FPGAs are reconfigurable hardware components that can be programmed to perform various tasks, offering a balance between flexibility and performance:

Customizability: FPGAs can be reprogrammed for different workloads, making them adaptable to changing requirements.
Parallelism: FPGAs excel at parallel processing, which is highly beneficial for certain AI and HPC tasks.
Learning Curve: Working with FPGAs often requires specialized expertise in hardware design and programming, potentially lengthening the development cycle.

FPGA's tend to be used in local, embedded solutions utilizing OpenCL, and don't tend to be used HPC-centric workloads.?Think self-driving cars, medical imaging, and machine vision.?Intel actually make a FPGA, called the Stratix 10 GX (released 2018) which achieves 143 INT8 TOPS at up to 225W, or around 1/2 of a last-gen AMD Instinct MI250X.

In summary…

The HPC and AI landscape is evolving, and whilst the obvious choice for hardware accelerators has overwhelmingly NVIDIA GPUs, AMD specifically, is gaining traction with their GPUs, offering a competitive alternative. Intel is very early in their entry and more fringe options like ASICs and FPGAs bring unique advantages but also challenges related to customization, programming complexity, and development time.

As the demand for computational power continues to grow, understanding and exploring these alternatives will be crucial for making informed decisions in optimizing HPC and AI workloads.

Infrastructure as a Newsletter

1,591 位关注者

??Brian Keltner??

Strategic Fractional CMO | Reputation Management Specialist | Driving Business Growth Through Marketing Leadership & Brand Strategy | Expert in Customer Acquisition & Digital Presence Optimization | Gunslinger

1 年

Nick, thanks for sharing!

1 次回应

Robert Linsdell

General Manager, Australia, New Zealand and APAC, at Ekkosense.

1 年

Another insightful share Nick, thank you

4 次回应

Paul Edmondson

EMEA VP & GM at GRC: The Immersion Cooling Authority

1 年

With the abundance of AI workloads driving IT architecture to the max be that Nvidia, Intel or AMD the only real question is, how do you cool the Infrastucture? Our answer is simple, if you are in doubt see how our customer is overcoming these very challenges https://www.grcooling.com/learning-center/grc-dell-tacc-immersion-cooling-case-study/

4 次回应

Andy Ramgobin

CEO & Co Founder at CodeZero / CEO & Co Founder at NYX VX - Epic MegaGrant Recipient / CEO & Founder at Momentum Enterprise Solutions (24k+)

1 年

You know my feelings on this Nick :) I’m doing a lot of work with both AMD and Intel and I plan to get my hands on a Habana Gaudi platform to specifically run Deep Learning and HPC workloads. AMD have ROCm for porting CUDA applications and both Intel and AMD GPUs work with Tensorflow and PyTorch which are two very common software libraries. NVIDIA’s software stack is fantastic, especially AI for Enterprise (VMware have EOL’d Bitfusion) which integrates directly into VMware Tanzu, but NVIDIA is not the only answer for AI, ML and HPC workloads. Both Intel and AMD have embraced immersion cooling which makes them ideal vendors to work with :)

8 次回应

查看更多评论

要查看或添加评论，请登录

Nick Hume的更多文章

Behind the Curtain: AWS re:Invent 2024 Highlights

2024年12月11日

Behind the Curtain: AWS re:Invent 2024 Highlights

Expanding on my post from last week, it was great to see AWS leaning back into their engineering roots at re:Invent…

3 条评论
OCP Global Summit 2024 Series

2024年11月13日

OCP Global Summit 2024 Series

For the final piece of the Global Summit wrap up, I focus on Networking, both inside the server and between racks, and…
OCP Global Summit 2024 Series

2024年11月8日

OCP Global Summit 2024 Series

We've touched on the power innovations at the summit, so obviously, the next logical step is to talk about cooling…

2 条评论
OCP Global Summit 2024 Series

2024年11月7日

OCP Global Summit 2024 Series

Originally planned as a two-part reflection, my series from the fantastic OCP Summit has grown into a series! Up next:…

2 条评论
OCP Global Summit 2024 Series

2024年11月5日

OCP Global Summit 2024 Series

It’s been a busy conference season, with the AI Hardware and Edge AI Summit, Yotta 2024, and OCP’s Global Summit all…

3 条评论
AI for real life

2024年10月5日

AI for real life

As I’ve been busy with my day job(s) and various projects, like the Tech Insider Podcast, I haven’t put my hands to the…

1 条评论
To InfiniBand, maybe beyond?

2024年7月18日

To InfiniBand, maybe beyond?

Nvidia's latest roadmap was teased at Computex in Taiwan last month. Whilst details were a little light on PFLOPS and…
Apple, not Artificial, Intelligence

2024年7月1日

Apple, not Artificial, Intelligence

Just last month, Apple hosted their yearly WWDC - an event where they showcase all the updates to their platforms…
Oh great, another podcast...

2024年6月13日

Oh great, another podcast...

As you may have seen (or heard my "Ausmerican" accent) recently, I've started a podcast, and wanted to share a little…

2 条评论
OCP 2024 Regional Summit wrap

2024年5月22日

OCP 2024 Regional Summit wrap

The Open Compute Project (OCP) Regional Summit was hosted in Lisbon, Portugal last month, the 5th (and largest)…

See all articles

Beyond NVIDIA: Is AMD the only GPU alternative for HPC/AI Workloads

Nick Hume

Global Digital Infrastructure Executive | Sustainable AI & Liquid Cooling Authority | Podcast Host | ex-AMZN | ex-MSFT

领英推荐

Infrastructure as a Newsletter

1,591 位关注者

Nick Hume的更多文章

社区洞察

其他会员也浏览了

CPU, GPU, or TPU in 2025: How to Choose the Right Processor for Your Needs

Beyond CUDA: The Future of High-Performance Computing (HPC) and the Push for Open Acceleration

Prepare to rewrite your AI Infrastructure Roadmap to account for Blackwell

Choosing Transcoding Hardware: Deciphering the Superiority of ASIC-based Technology

Intel 5th Gen Xeon Scalable, Supermicro GPU Server Review, KIOXIA SSD Performance, More...

Drivers of Packaging Substrate Technology Development

Grace Under Pressure: Nvidia's Debut CPU Elevates HPC to New Heights

NVidia GH200 Architecture Blew My Mind Today

The Functioning of VGPUs

领英推荐

Infrastructure as a Newsletter

1,591 位关注者

Nick Hume的更多文章

Behind the Curtain: AWS re:Invent 2024 Highlights

OCP Global Summit 2024 Series

OCP Global Summit 2024 Series

OCP Global Summit 2024 Series

OCP Global Summit 2024 Series

AI for real life

To InfiniBand, maybe beyond?

Apple, not Artificial, Intelligence

Oh great, another podcast...

OCP 2024 Regional Summit wrap

社区洞察

其他会员也浏览了

CPU, GPU, or TPU in 2025: How to Choose the Right Processor for Your Needs

Beyond CUDA: The Future of High-Performance Computing (HPC) and the Push for Open Acceleration

Prepare to rewrite your AI Infrastructure Roadmap to account for Blackwell

Choosing Transcoding Hardware: Deciphering the Superiority of ASIC-based Technology

Intel 5th Gen Xeon Scalable, Supermicro GPU Server Review, KIOXIA SSD Performance, More...

Drivers of Packaging Substrate Technology Development

Grace Under Pressure: Nvidia's Debut CPU Elevates HPC to New Heights

NVidia GH200 Architecture Blew My Mind Today

The Functioning of VGPUs