登录查看更多内容

OCP Global Summit 2024 Series

Nick Hume

Global Digital Infrastructure Executive | Sustainable AI & Liquid Cooling Authority | Podcast Host | ex-AMZN | ex-MSFT

发布日期: 2024年11月13日

For the final piece of the Global Summit wrap up, I focus on Networking, both inside the server and between racks, and the crystal ball state of AI.

Networking

In the area of networking, the conference shed light on some fantastic initiatives that are set to redefine how we connect and scale AI systems. These are set to give users viable alternatives and real choice, providing some well needed competition in the space, and reduce reliance and supply-chain risks.

SCALE UP with UALink (Ultra Accelerator Link)

As an alternative to NVLink, UALink focuses on scaling up by connecting GPUs together to form a more powerful, unified GPU. The UALink Consortium has the who's who of hyperscalers ( Meta , Amazon Web Services (AWS) , 微软 ), silicon providers ( 英特尔 , AMD ) and networking ( 思科 , 诺基亚 ) plus many others, collaborating together to provide a solution for interconnecting non-NVIDIA chips (like AMD's MI300X, or Intel's Gaudi 3) in a single node, at high speed and low latency, to create a large logical processor, sharing resources (critically, memory) to host GenAI workloads like LLMs.

Essentially, make the biggest GPU possible.

SCALE OUT with UEC (Ultra Ethernet Consortium)

One of the key highlights was the progress made on products from the Ultra Ethernet Consortium (UEC). The UEC is spearheading efforts to develop next-generation Ethernet technologies tailored for AI workloads. Not unsurprisingly, the same folks interested in solving for connecting accelerators (GPUs) together, are many of the same ones that want to connect as many large GPUs together as possible.

Products based on UEC are promising InfiniBand-like performance and features, using traditional Ethernet tooling.

For a little more detail on Ultra Ethernet, be sure to watch this;

领英推荐

Carrier aggregation breakthrough, new network slicing…

Samsung Networks 1 年前

Ultra Ethernet Consortium Set to Enable Scaling of…

Synopsys Users Group (SNUG) 4 个月前

NADDOD Leads in Compatibility and Performance on Thor2…

NADDOD 1 个月前

What was also great, was seeing AMD’s UEC NICs. AMD presented what appears to be their first UEC-compatible NIC—the AMD Pensando Pollara 400GbE card. While the name might be a mouthful, it’s exciting to see AMD sampling this card in Q4, with general availability expected around Q2 of 2025. This aligns with AMD’s acquisition of Pensando and their push into advanced networking solutions.

Patrick Kennedy at the ServeTheHome team have fantastic coverage on their website https://www.servethehome.com/amd-pensando-pollara-400-ultraethernet-rdma-nic-launched/

To explore both Ultra's (UALink and UEC) at a high-level, check this helpful presentation out;

Other Tidbits

? Intel’s New IPU (Infrastructure Processing Unit): Intel introduced a new NIC, which they refer to as an IPU—essentially their take on a DPU (Data Processing Unit). This move signifies Intel’s commitment to offloading and accelerating network functions, improving data center efficiency.

? Dell’s ORv3 Rack with Liquid-Cooled NVLink: Dell showcased their XE9712, an ORv3 rack equipped with liquid-cooled NVLink (specifically the NVIDIA NVL72). This setup supports up to a staggering 180 kW per rack, highlighting the intense power and cooling requirements of modern AI infrastructure.

I talk about this briefly in my OCP podcast recording with Rob Coyle , so stay tuned for that one!

AI's Shift Towards Inference

A significant theme that caught my attention was the industry’s pivot towards AI inference. For years, I’ve been emphasizing the impending dominance of inference workloads over training. Training massive AI models has been the focus, with organizations scrambling to build solutions and supply chains to meet these demands. However, it’s now widely accepted that inference—the deployment and utilization of these trained models—is the long tail of AI, and Enterprise AI utilization, really, has not started (yet).

This shift is prompting organizations to prepare for the unique challenges and opportunities that inference presents. From optimizing hardware for lower latency and higher throughput to rethinking data center designs for efficiency, the focus is broadening. It’s an exciting time as the industry adjusts to balance both training and inference workloads effectively.

Thanks again to the Open Compute Project Foundation team, and all the contributing members for moving the needle, sharing their work and partnering in a collaborative way. It's a great, and thriving community, and a credit to all involved who dedicate so much time, blood, sweat and tears.

Look forward to 2025!

Infrastructure as a Newsletter

1,581 位关注者

要查看或添加评论，请登录

Nick Hume的更多文章

Behind the Curtain: AWS re:Invent 2024 Highlights

2024年12月11日

Behind the Curtain: AWS re:Invent 2024 Highlights

Expanding on my post from last week, it was great to see AWS leaning back into their engineering roots at re:Invent…

3 条评论
OCP Global Summit 2024 Series

2024年11月8日

OCP Global Summit 2024 Series

We've touched on the power innovations at the summit, so obviously, the next logical step is to talk about cooling…

2 条评论
OCP Global Summit 2024 Series

2024年11月7日

OCP Global Summit 2024 Series

Originally planned as a two-part reflection, my series from the fantastic OCP Summit has grown into a series! Up next:…

2 条评论
OCP Global Summit 2024 Series

2024年11月5日

OCP Global Summit 2024 Series

It’s been a busy conference season, with the AI Hardware and Edge AI Summit, Yotta 2024, and OCP’s Global Summit all…

3 条评论
AI for real life

2024年10月5日

AI for real life

As I’ve been busy with my day job(s) and various projects, like the Tech Insider Podcast, I haven’t put my hands to the…

1 条评论
To InfiniBand, maybe beyond?

2024年7月18日

To InfiniBand, maybe beyond?

Nvidia's latest roadmap was teased at Computex in Taiwan last month. Whilst details were a little light on PFLOPS and…
Apple, not Artificial, Intelligence

2024年7月1日

Apple, not Artificial, Intelligence

Just last month, Apple hosted their yearly WWDC - an event where they showcase all the updates to their platforms…
Oh great, another podcast...

2024年6月13日

Oh great, another podcast...

As you may have seen (or heard my "Ausmerican" accent) recently, I've started a podcast, and wanted to share a little…

2 条评论
OCP 2024 Regional Summit wrap

2024年5月22日

OCP 2024 Regional Summit wrap

The Open Compute Project (OCP) Regional Summit was hosted in Lisbon, Portugal last month, the 5th (and largest)…
Here come the Inferencing ASIC's

2024年4月15日

Here come the Inferencing ASIC's

The tidal wave of Generative AI (GenAI) has mostly consisted of training large language models (LLM's), like GPT-4, and…

25 条评论

See all articles

OCP Global Summit 2024 Series

Nick Hume

Global Digital Infrastructure Executive | Sustainable AI & Liquid Cooling Authority | Podcast Host | ex-AMZN | ex-MSFT

Networking

领英推荐

Other Tidbits

AI's Shift Towards Inference

Infrastructure as a Newsletter

1,581 位关注者

Nick Hume的更多文章

社区洞察

其他会员也浏览了

Saturday Special - Top 5 Must Reads: 12/2/2023

Decoding the Global Retimer Ecosystem: Growth, Challenges, and Future Trends

ATTEND's Connectivity Solutions for In-Vehicle Systems for a railway market.

Illuminations: The Recap Edition!

Mo(o)re for Less! Enabling Productivity, Performance, Purpose and Power with the latest 4th Gen Intel ? Xeon ? Scalable Processors - Plus a Look Ahead

‘My Precious’: Meta Minipack3 – A 51.2T Marvel Leading the 800G Era

Open XR Optics Forum Releases 400G Optical Transceiver Form Factor and Optical Interface Specifications

Intel's Foundry Day Focuses on Advanced Packaging

Understanding NVIDIA InfiniBand Networking: Routing, Switching, and Its Benefits for AI Infrastructure and High-Performance Computing (HPC)

Networking

领英推荐

Other Tidbits

AI's Shift Towards Inference

Infrastructure as a Newsletter

1,581 位关注者

Nick Hume的更多文章

Behind the Curtain: AWS re:Invent 2024 Highlights