OCP Global Summit 2024 Series

OCP Global Summit 2024 Series

For the final piece of the Global Summit wrap up, I focus on Networking, both inside the server and between racks, and the crystal ball state of AI.


Networking

In the area of networking, the conference shed light on some fantastic initiatives that are set to redefine how we connect and scale AI systems. These are set to give users viable alternatives and real choice, providing some well needed competition in the space, and reduce reliance and supply-chain risks.


SCALE UP with UALink (Ultra Accelerator Link)

As an alternative to NVLink, UALink focuses on scaling up by connecting GPUs together to form a more powerful, unified GPU. The UALink Consortium has the who's who of hyperscalers ( Meta , Amazon Web Services (AWS) , 微软 ), silicon providers ( 英特尔 , AMD ) and networking ( 思科 , 诺基亚 ) plus many others, collaborating together to provide a solution for interconnecting non-NVIDIA chips (like AMD's MI300X, or Intel's Gaudi 3) in a single node, at high speed and low latency, to create a large logical processor, sharing resources (critically, memory) to host GenAI workloads like LLMs.

Essentially, make the biggest GPU possible.

SCALE OUT with UEC (Ultra Ethernet Consortium)

One of the key highlights was the progress made on products from the Ultra Ethernet Consortium (UEC). The UEC is spearheading efforts to develop next-generation Ethernet technologies tailored for AI workloads. Not unsurprisingly, the same folks interested in solving for connecting accelerators (GPUs) together, are many of the same ones that want to connect as many large GPUs together as possible.

Products based on UEC are promising InfiniBand-like performance and features, using traditional Ethernet tooling.

For a little more detail on Ultra Ethernet, be sure to watch this;

What was also great, was seeing AMD’s UEC NICs. AMD presented what appears to be their first UEC-compatible NIC—the AMD Pensando Pollara 400GbE card. While the name might be a mouthful, it’s exciting to see AMD sampling this card in Q4, with general availability expected around Q2 of 2025. This aligns with AMD’s acquisition of Pensando and their push into advanced networking solutions.

Patrick Kennedy at the ServeTheHome team have fantastic coverage on their website https://www.servethehome.com/amd-pensando-pollara-400-ultraethernet-rdma-nic-launched/

To explore both Ultra's (UALink and UEC) at a high-level, check this helpful presentation out;

Other Tidbits

? Intel’s New IPU (Infrastructure Processing Unit): Intel introduced a new NIC, which they refer to as an IPU—essentially their take on a DPU (Data Processing Unit). This move signifies Intel’s commitment to offloading and accelerating network functions, improving data center efficiency.

? Dell’s ORv3 Rack with Liquid-Cooled NVLink: Dell showcased their XE9712, an ORv3 rack equipped with liquid-cooled NVLink (specifically the NVIDIA NVL72). This setup supports up to a staggering 180 kW per rack, highlighting the intense power and cooling requirements of modern AI infrastructure.

I talk about this briefly in my OCP podcast recording with Rob Coyle , so stay tuned for that one!

AI's Shift Towards Inference

A significant theme that caught my attention was the industry’s pivot towards AI inference. For years, I’ve been emphasizing the impending dominance of inference workloads over training. Training massive AI models has been the focus, with organizations scrambling to build solutions and supply chains to meet these demands. However, it’s now widely accepted that inference—the deployment and utilization of these trained models—is the long tail of AI, and Enterprise AI utilization, really, has not started (yet).

This shift is prompting organizations to prepare for the unique challenges and opportunities that inference presents. From optimizing hardware for lower latency and higher throughput to rethinking data center designs for efficiency, the focus is broadening. It’s an exciting time as the industry adjusts to balance both training and inference workloads effectively.


Thanks again to the Open Compute Project Foundation team, and all the contributing members for moving the needle, sharing their work and partnering in a collaborative way. It's a great, and thriving community, and a credit to all involved who dedicate so much time, blood, sweat and tears.

Look forward to 2025!

要查看或添加评论,请登录

Nick Hume的更多文章

  • Behind the Curtain: AWS re:Invent 2024 Highlights

    Behind the Curtain: AWS re:Invent 2024 Highlights

    Expanding on my post from last week, it was great to see AWS leaning back into their engineering roots at re:Invent…

    3 条评论
  • OCP Global Summit 2024 Series

    OCP Global Summit 2024 Series

    We've touched on the power innovations at the summit, so obviously, the next logical step is to talk about cooling…

    2 条评论
  • OCP Global Summit 2024 Series

    OCP Global Summit 2024 Series

    Originally planned as a two-part reflection, my series from the fantastic OCP Summit has grown into a series! Up next:…

    2 条评论
  • OCP Global Summit 2024 Series

    OCP Global Summit 2024 Series

    It’s been a busy conference season, with the AI Hardware and Edge AI Summit, Yotta 2024, and OCP’s Global Summit all…

    3 条评论
  • AI for real life

    AI for real life

    As I’ve been busy with my day job(s) and various projects, like the Tech Insider Podcast, I haven’t put my hands to the…

    1 条评论
  • To InfiniBand, maybe beyond?

    To InfiniBand, maybe beyond?

    Nvidia's latest roadmap was teased at Computex in Taiwan last month. Whilst details were a little light on PFLOPS and…

  • Apple, not Artificial, Intelligence

    Apple, not Artificial, Intelligence

    Just last month, Apple hosted their yearly WWDC - an event where they showcase all the updates to their platforms…

  • Oh great, another podcast...

    Oh great, another podcast...

    As you may have seen (or heard my "Ausmerican" accent) recently, I've started a podcast, and wanted to share a little…

    2 条评论
  • OCP 2024 Regional Summit wrap

    OCP 2024 Regional Summit wrap

    The Open Compute Project (OCP) Regional Summit was hosted in Lisbon, Portugal last month, the 5th (and largest)…

  • Here come the Inferencing ASIC's

    Here come the Inferencing ASIC's

    The tidal wave of Generative AI (GenAI) has mostly consisted of training large language models (LLM's), like GPT-4, and…

    25 条评论

社区洞察

其他会员也浏览了