登录查看更多内容

AI Requirements for Datacenter Networking

Peter Welcher

发布日期: 2025年2月18日

Author: Pete Welcher. Coauthor: Brad Gregory.

This blog is a sequel to Brad Gregory’s introduction to Typical AI Network Traffic Patterns. Brad’s blog covers what AI training and inferencing each need from a datacenter network infrastructure at a high-level. Intent: executive summary.

This blog gets more technical, while briefly covering what the major datacenter switch vendors recommend for infrastructure.

The most stringent near-term datacenter demands are the high performance for LLM training. The recommended designs can support inferencing in the datacenter as well, at least in the short term.

The reason I said “in the short term” is that, as Brad notes, future edge inferencing for real-time applications of AI control and agentic AI may require fast low-latency WAN connections, per various sources. Other use of AI models, may not have such stringent timing requirements.

Fun fact: Concerning low-latency WAN/Edge, the speed of light is a limiting factor. The stock market trading networks shifted to wireless/microwave rather than fiber optic networking because the speed of light is faster in air than fiber (294,000 km/sec vs. 200,000, per Google search).

For what it’s worth, Google search does come up with multiple article titles such as “The Future of Inferencing is at the Edge”.

This blog is a bit long, and is structured as follows:

Two sections of general comments
A summary of key technical points. I started out doing that by vendor, but there was near 100% duplication (times 4), so I merged things, with vendor-specific notes in places.
Brief description of recommended topologies (they all recommend much the same thing, unless I’m missing something)
Plug for a follow-on blog, which will cover typical datacenter network designs and bandwidth/port count calculations for AI.

For your visual amusement, here’s an image ChatGPT/DALL-E generated from the prompt “Create image showing a datacenter network supporting AI training”.

AI generated image suggestive of AI datacenter

Why Build Your Own?

Buying AI services from a major vendor (Cloud or AI vendor) is the fastest way to get started. Especially given how costly it can be, and the time it takes, to get an AI section of datacenter built. Even more, when you consider that datacenters seem to be perennially lagging power and cooling, as those requirements continue rapid growth. I’ve seen enough sparsely populated racks over the years due to such issues. And to weight considerations too!

So buying AI datacenter capacity may be needed in the short term anyway, while your datacenter buildout takes place.

It may make good sense to fund AI as OpEx in the short term, to get a better handle on your organization’s needs, funding, etc. Aka “stall for better data”.

How strongly do you feel AI’s role will be for your organization, going forward? Do the executives agree, and will they fund it?

In the long run, it may be less costly to build your own capacity. It’s a heck of a commitment: AI datacenters and networking are NOT cheap. And take time to build. Although your needs may not require building or acquiring an entire nuclear power plant, like some of the AI and Cloud providers are reportedly doing.

DeepSeek’s Impact

DeepSeek has potentially impacted the business case.

The latest I’ve seen says DeepSeek may or may not have fudged the performance specs. I saw some coverage stating why there was suspicion, but haven’t seen anything I regard as definitive. Perhaps because only the Chinese know for sure?

My impression is DeepSeek still may have been able to get comparable results at significantly lower costs, just not as spectacularly lower as before. (Cf. online threads starting around 2/2/2025.)

If LLM’s can be trained much more cheaply does that make it more attractive for businesses to do AI in-house rather than with a provider (CapEx vs OpEx)?

It does reduce the barrier to entry/competition. As articles have noted.

AI-aaS still may have a lot of impact for those who don’t/can’t hire to do in-house training, etc. Or don’t want to fund the infrastructure for in-house training, just for inferencing?

Did the US AI firms not optimize, viewing the high cost as barrier to entry/competition PLUS (here’s the cynical part) a way to attract higher amounts of capital? Thereby funding later work to optimize performance?

There’s a thought I had, then saw more or less reflected someone’s comment about DeepSeek. The thought: Will AI discover modular "small targetted LLM's" to replace the one-giant-LLM-that does it all? Apologies, I can’t find "someone's" blog or article I’d read that said that.

Merged Vendor Notes

There were a lot of similarities across vendors as to AI datacenter requirements. The main differences were touting their relevant datacenter switches.

Even if you prefer one vendor, reading the literature from the others might still be informative!

Here are the important common themes, broken out into topical areas:

AI/ML infrastructure requirements:

Enterprise needs consistent, economical architecture for both training and inference.
The goal: one big compute resource, keeping GPU’s busy, minimizing Job Completion Time (JCT).
Workloads rely on GPU clusters interconnected with high-speed networking to process vast datasets efficiently.
Juniper called out AI training efficiency and costs as the basis for highly performative AI datacenter networking: maximize what you get from the costly GPU resources!
AI/ML applications require high throughput, low latency, and lossless networking to handle large-scale training and inference workloads.
The scale of AI training may well require 400/800 Gbps links.
Inference will be at the edge, mix of LLMs or small SLMs and private AI clouds.
You will need flexible clustering, multi-tenancy support and security.
All the vendors sketched out design variations depending on the desired scale. All the vendors show a two-tier (spine-leaf) CLOS fabric, or several fabrics based on the various sized datacenter switches being used, especially in the spine role.
Use a dedicated back-end network for high-performance GPU to GPU interconnect.

Use RoCEv2 (RDMA over Converged Ethernet v2):

High-speed Ethernet is emerging as strong Infiniband alternative, better economics, less support complexity.
Use Ethernet for both front and back-end inference, for consistent architecture, fewer siloes, and operational simplicity.
RoCEv2 enables direct memory-to-memory transfers between GPUs over Ethernet, reducing CPU overhead and improving performance.
It supports routing and integrates into existing Ethernet-based enterprise networks.

Building Lossless Networks:

AI needs low latency, especially tail latency, and short Job Completion Time (JCT).
Explicit Congestion Notification (ECN): Signals congestion by marking packets to help end hosts adjust transmission rates.
Priority Flow Control (PFC): Prevents packet drops by pausing data transmission at congestion points.
Arista: Arista’s EOS contains AI Analyzer and Latency Analyzer, to help determine the correct PFC and ECN parameters based on workload traffic patterns.
Cisco: Approximate Fair Drop (AFD): Identifies and manages high-bandwidth "elephant flows" separately from smaller "mice flows.”
Juniper: Dynamic Load Balancing (DLB): Per-flow ability for software to selectively spread flows across links while preserving in-order delivery.

Congestion Management:

There is evolution towards Ethernet advanced load balancing and congestion management. All the vendors emphasized this and most covered it at length.
Cisco: ECN and PFC work together (via Data Center Quantized Congestion Notification, DCQCN) to ensure lossless Ethernet transport. Other vendors: minor differences but same key idea, avoid blocking and packet loss.

Network Visibility and Automation:

Automated network provisioning (and re-configuration?), optimization for AI/ML workload performance.
Central management with real-time telemetry, congestion monitoring, and troubleshooting tools.
AIOps: use of AI to monitor the AI fabric.

Network Design Examples:

The vendors all had sketches of the datacenter networks (front-end and back-end), in varying degrees of detail.
Scaling Considerations: Networks can be expanded by adding leaf and spine switches, supporting both MSDC (Massively Scalable Data Center) and VXLAN architectures.

Per Vendor Highlights

Vendor-specific congestion avoidance mechanisms were noted above.

Other per-vendor highlights follow.

Arista:

Arista noted that Meta is using Arista for AI deployments

Noteworthy: Arista recently announced their smaller switch “distributed switch” technology with virtual stacking. This seems like a very interesting approach apparently allowing you to grow AI compute clusters. It does not require stacking cables. Juniper has virtual stacking but with dedicated cabling between stack members.

See the Arista 7060X6 switch link below for lots of technical details.

See also the AI Network WP (White Paper) link. It loosely sketches out several fabric designs with different switch sizes and scales.

领英推荐

PFH Office of the CTO Newsletter - Issue 3 - November…

PFH Technology Group 4 个月前

CtrlS Datacenters appoints former Amazon executive…

CtrlS Datacenters 1 年前

VXLAN to the Rescue

Rahi 2 年前

Cisco:

Cisco’s writeups were arguably a bit more deeply technical than the others. All had a good amount of detail. Cisco Nexus 9000 switches support the above features with intelligent buffering and telemetry.

Cisco had the most design detail, going into details of sample switch models, port counts and speeds for a couple of redundant non-blocking CLOS fabric designs. I call this “fabric ports and bandwidth math”, and intend to write a follow-on blog demonstrating it.

100Gbps Shared Network: Uses a two-tier spine-leaf topology with non-blocking fabrics for optimal performance.
400Gbps Back-End Network: Dedicated to high-speed GPU-to-GPU communication, minimizing latency. (The document also talks about 800Gbps.)

HPE:

HPE didn’t seem to have much of technical depth to say about high-speed networking for AI. Their marketing literature has some good business level discussion of general requirements, e.g. quality of data, security, etc. I.e. less technical, more management-directed content. I poked around their website some but did not find more technical content, perhaps just missed it. Their documents ultimately ended up with more of a compute/storage focus. E.g. their “data fabric”. That’s outside the present document’s scope and perhaps not very relevant to AI, other than perhaps for storing massive amounts of AI training materials.

Juniper:

I treated Juniper as separate from HPE, since various pundits think the HPE acquisition will fall through.

Juniper had good details re congestion mechanisms, perhaps in a bit more depth than the others.

Juniper slammed Infiniband, positioning RoCE/Ethernet as open technology.

They stated that their solution has 3 foci:

Operations-first: save time and money without vendor lock-in (Apstra!).
Standards-based Ethernet for velocity and cost savings.
End to end solutions that easily build AI datacenters with flexibility (my take: hinting at Apstra’s automation).

Juniper then elaborated on this, mentioning their chipsets, congestion controls, Apstra blueprints. And noting Apstra’s support for Nvidia rail-optimized designs.

I have provided a second Juniper link below, for a document that goes into a lot more detail, including what I call “fabric ports and bandwidth math”, worked out in detail. That does get rather technical, lots of details!

NOTE: Juniper presented a lot of great AI datacenter content at Cloud Field Day 20, which I was a delegate at. Their linked documents below cover some of the same material, but the recorded videos go deeper. Highly recommended: follow the link below.

Nokia:

Nokia was unique in supporting both Infiniband or Ethernet, likely based on their customer base. The others recommend Ethernet as open, simpler, and more flexible. And what they have in their inventory! Some also mention Infiniband in passing.

Nokia re Infiniband: traditional for RDMA, but RoCEv2 provides RDMA over Ethernet w/ Infiniband payload … potential of ultra Ethernet transport.

Nokia also mentioned their NVLink “infinity fabric” for GPU to GPU communication, vs. frames going to leaf and back (higher latency but cheaper).

Nokia also discussed rail-optimized versus CLOS topologies.

Nokia advised: technology is fast evolving in this space: collaborate with friends. Also avoid snowflake designs, look for holistically optimized designs.

Nokia went briefly into some of the non-network planning and design topics, including: preparation for buildout, purchase, power, land, cooling, ease of operations, staffing, automation, toolchain, future growth.

Design Diagrams Galore

To their credit, all the vendors got down to brass tacks with topology diagrams, albeit in various degrees of detail. As noted above, some even showed various size datacenters with specific switch models (Arista, Cisco).

The good news there is that some (Cisco) even did the port counting for how many spine and leaf switches and how many links between them. I call this “port and bandwidth math”.

Since this blog is already too long, I’m considering covering “port/bandwith math” in a follow-on blog. The details are important!

Conclusions

Traditional Networking and Switch vendors prefer fabric topologies and high-speed Ethernet switches. Little surprise there? The key point to that is consistent simple design that scales up well.

The Cisco Blueprint link below provides extensive discussion of most of the factors mentioned above, along with several fabric design diagrams. The Arista AI Networking document below has some good topology diagrams and discussion.

Latency considerations mean that at most a two-layer spine-leaf network topology should be used, unless tremendous scale is needed of course.

Where possible, using a single (possibly very large) switch for the back end network minimizes latency: single hop between XPU’s.

The front end needs to support control/management traffic and communications into the training cluster, unless out of band management is used of course.

Arista’s virtual stacking technology provides an interesting design alternative to using big chassis switches for the spine role.

Nokia is a bit more agnostic re datacenter technologies (Infiniband versus high-speed and ultra Ethernet) and perhaps chip/hardware-centric approaches as high-performance alternative. Note their historical telecom focus. The Nokia link below is fairly generic, but their presentation deck goes into a lot more good detail including fabric diagrams.

Concerning Infiniband vs High-Speed Ethernet, personally, I think simplicity is good. The fewer different technologies used, the simpler buildout and operations will be.

Note: For both Arista and Cisco, I did a small experiment after taking summary notes about their documents. ChatGPT did a very good job of succinctly summarizing the key document in both cases. For Brad’s predecessor blog, we saw a ChatGPT summary that got the main points right but was way too verbose.

Somewhat Related Links

Arista 7060X6 Series: Accelerated DC and AI Networking at 800G
Cisco: Get your infrastructure up to speed for AI workloads
Cisco: Beyond the Data Center: High-Performance Networks for AI (about datacenter interconnect via dark fiber or Cisco routed optical networking)

Miscellany

Reminder: you may want to check back on my articles on LinkedIn to review any comments or comment threads. They can be a quick way to have a discussion, correct me, or share you perspectives on technology.

Hashtags: #PeterWelcher #BradGregory #CCIE1773 #AINetworking #AIDataCenter

FTC disclosure statement: https://www.dhirubhai.net/pulse/ftc-disclosure-statement-peter-welcher-y8wle/ ? ?

Twitter: @pjwelcher

LinkedIn: Peter Welcher, https://www.dhirubhai.net/in/pjwelcher/

Mastodon: @[email protected]

BlueSky: https://bsky.app/profile/pjwelcher.bsky.social

Three badge logos: Cisco Champion, CCIE Lifetime Emeritus, Networking Field Day

? ?

要查看或添加评论，请登录

Peter Welcher的更多文章

Introduction to Microsegmentation

2025年3月18日

Introduction to Microsegmentation

This blog begins an introductory series of moderately long blogs, covering key aspects of Microsegmentation and Zero…

3 条评论
Pete’s Take: Catchpoint at Cloud Field Day 22

2025年3月11日

Pete’s Take: Catchpoint at Cloud Field Day 22

Tech Field Day always produces such great technical content! However, it can be a challenge keeping up with it due to…
AI Ate My Blog on RoCEv2

2025年2月27日

AI Ate My Blog on RoCEv2

I acknowledge I’ve been a blog technology summarizer for quite a while. It served to help me broaden/solidify my skills…
AI Datacenter Switch Math

2025年2月25日

AI Datacenter Switch Math

Author: Pete Welcher, Coauthor: Brad Gregory This is blog #3 in a small series about Networking for AI Datacenters…
Quick Takes #2, February 2025

2025年2月12日

Quick Takes #2, February 2025

I’m working on some longer blogs that I hope to be able post in the next week or two. In the meantime, lots of exciting…
Quick Takes: February 2025

2025年2月4日

Quick Takes: February 2025

I’ve got some longer technical blogs in the works. For this week, it’s time again for some of my “Quick Takes”:…
Pete’s Take: Pain Points in Networking and IT

2025年1月28日

Pete’s Take: Pain Points in Networking and IT

It’s a new year, so time to look at how Networking and IT have been evolving. Ignoring the AI elephant in the room.

1 条评论
Pete’s Take: Pondering NetOps/AIOps Strategy

2025年1月22日

Pete’s Take: Pondering NetOps/AIOps Strategy

What’s new in NetOps, including AIOps, and where are things heading? Some thoughts ..

1 条评论
Pete's Take: AI/ML and Error

2025年1月14日

Pete's Take: AI/ML and Error

Artificial Intelligence (AI) has certainly received a lot of press lately. And achieved new levels of hype.
Book Review: Machine Learning for Network (etc.) by Javier Antich

2025年1月9日

Book Review: Machine Learning for Network (etc.) by Javier Antich

Welcome to 2025. I’m easing back into blogging for 2025 after fun and (sort of) relaxing holidays with visits by 3…

1 条评论

See all articles

AI Requirements for Datacenter Networking

Peter Welcher

Why Build Your Own?

DeepSeek’s Impact

Merged Vendor Notes

Per Vendor Highlights

领英推荐

Design Diagrams Galore

Conclusions

Links

Somewhat Related Links

Miscellany

Peter Welcher的更多文章

社区洞察

其他会员也浏览了

Get Ready for AutoCon 1: How the NAF Community & Vendors Can Accelerate Automation

Packet Capture in the Cloud…and Beyond!

Benefits of colocation datacenter over on-premise datacenter

BANKS AND THEIR DATACENTERS

Transit Gateway Setup and VPN to Datacenter

What is VMware NSX :

Network-as-a-Service with 5G, SASE, MCN and role of Kubernetes and Project EMCO

Meta’s OSDI’24 Paper on Datacenter Resource Allocation

Top 10 Best Datacenter Proxy Providers 2025

The Carrier Cloud Needs A New Fabric, Not A Patched Cloth

Why Build Your Own?

DeepSeek’s Impact

Merged Vendor Notes

Per Vendor Highlights

领英推荐

Design Diagrams Galore

Conclusions

Links

Somewhat Related Links

Miscellany

Peter Welcher的更多文章

Introduction to Microsegmentation

Pete’s Take: Catchpoint at Cloud Field Day 22

AI Ate My Blog on RoCEv2

AI Datacenter Switch Math

Quick Takes #2, February 2025

Quick Takes: February 2025

Pete’s Take: Pain Points in Networking and IT

Pete’s Take: Pondering NetOps/AIOps Strategy

Pete's Take: AI/ML and Error

Book Review: Machine Learning for Network (etc.) by Javier Antich

社区洞察

其他会员也浏览了

Get Ready for AutoCon 1: How the NAF Community & Vendors Can Accelerate Automation

Packet Capture in the Cloud…and Beyond!

Benefits of colocation datacenter over on-premise datacenter

BANKS AND THEIR DATACENTERS

Transit Gateway Setup and VPN to Datacenter

What is VMware NSX :

Network-as-a-Service with 5G, SASE, MCN and role of Kubernetes and Project EMCO

Meta’s OSDI’24 Paper on Datacenter Resource Allocation

Top 10 Best Datacenter Proxy Providers 2025

The Carrier Cloud Needs A New Fabric, Not A Patched Cloth