Evolution of Data Center Networking Designs and Systems for AI Infrastructure – Part 3
In parts 1 and 2 of this article, I covered changes in data center network designs as a result of modern AI training and inference applications.? We focused on the scale-out and scale-up portions of the backend network that connects off-the-shelf GPUs from vendors like NVIDIA and AMD. In this part 3 of the series, I express my observations on standardization efforts related to the AI backend network and requirements for success. Design of AI servers, an ecosystem element where backend network components are incorporated, is an important aspect of such standardization efforts; I will cover this and other ecosystem aspects as well.
Scale-out Networking Standardization
In part 1, I discussed standardization efforts related to the scale-out network (see figure 1) and the formation of the Ultra Ethernet Consortium (UEC) in 2023 to enhance Ethernet to better address AI backend networking requirements.? Today, InfiniBand solutions from NVIDIA dominate sales in this portion of the backend network; quoting from a recent article by The Register [1] – “…AI systems currently account for significantly less than 10 percent of the total addressable market for network switching, and, of that, about 90 percent of deployments are using NVIDIA/Mellanox's InfiniBand — not Ethernet…”.? This new AI backend portion of the data center network is already a very sizable market and growing fast; it seems like the incumbent Ethernet networking vendors got caught like a deer in front on the fast moving AI headlights.? The UEC's effort to standardize networking for AI based on Ethernet started only last year and that seems a bit late.?
Never bet against Ethernet … as the old adage in the networking industry goes.? Will Ethernet evolve to eliminate all things not Ethernet in AI networking?? There is great precedence for Ethernet as a technology and ecosystem.? It can boast of a proven history of evolving to become better at meeting the needs of new applications, in the process eliminating competing technologies or solutions.? For example, it did so with technologies and solutions needed for the Internet backbone in the 1990s and 2000s, and for hyper-scale data centers in the 2010s:
If history repeats itself, the UEC’s mission to improve Ethernet for AI networking must apply learnings from successful initiatives like the above.? In the examples above, advancements related to how Ethernet packets are routed and switched in the networking infrastructure, and how networking control and data plane layers related to Ethernet and IP processing are implemented. Compared to such past instances, with AI scale-out networking, the ecosystem that Ethernet will have to interplay with will be a greater barrier than advancements in the technology (like an improved transport for better congestion management) and the related Ethernet-based products (like a new class of Ethernet-based AI NICs and Switches).? The following ecosystem elements that powerful incumbents will continue to improve and extend for current solutions will need to be addressed with new Ethernet-based innovations for AI scale-out networking:
Compatibility with APIs and AI Frameworks
I covered the role of RoCE (RDMA over Converged Ethernet) in a previous post [4].? A significant part of the UEC’s efforts to improve congestion handling in Ethernet for AI networking has to do with improving the transport layer.? ?See figure 2.? It is great to see the UEC moving rapidly toward releasing the v1.0 set of specifications in 2024 with focus on maintaining compatibility with the verbs API layer while adding UEC-specific extensions [5].
With the RoCE v2 specification that has enjoyed large scale deployments since its release in 2014, only the InfiniBand (IB) transport had remained, with all other underlying layers replaced by Ethernet and IP.? The UEC v1.0 specifications will replace the IB transport layer. The IB transport layer will be the hardest to replace given the stickiness it has developed over more than a decade. AI frameworks (e.g., PyTorch) and Collective Communication Libraries (e.g., NCCL for NVIDIA GPUs and RCCL for AMD GPUs) have been built and enhanced over the verbs API layer that today resides on the IB transport layer.?
As we have seen over and over in the industry, established incumbents will continue to enhance and extend the current ecosystem built on the IB verbs API and transport layers, thereby extending the life of products that utilize the IB transport protocol.?
The UEC proponents, on the other hand, will have to find a way to replace the current IB transport protocol with the UEC transport protocol with minimum disruptions to the current AI frameworks and applications ecosystems.? For example, acceleration of collective operations implemented using CCL layers will have to be tuned for new UEC-defined capabilities like packet spraying and load balancing for improved congestion handling.? Given that the CCL layers for NVIDIA and AMD GPUs – sandwiched between the AI Frameworks and Verbs API layers - are controlled by the respective GPU vendors, adequate level of collaboration will be needed from them to achieve this goal.?
Interplay with Scale-up Networking Technologies and Solutions
The scale-up portion of the network has evolved from standards-based PCIe to proprietary GPU fabrics like NVLink and Infinity Fabric.? In part 2 of this article, I discussed this evolution and how the scale-out and scale-up networks in the AI backend network are increasingly entwined.? The tight play between the two is instigated by implementations in the CCL layer (used by popular AI Frameworks) to enable the highest bandwidth and congestion free paths between GPUs in a cluster.? The scale-up network implemented using proprietary GPU fabrics is the preferred path for GPU-to-GPU data movement, falling back to the scale-out network (implemented using InfiniBand or Ethernet) only if necessary.? The scale of proprietary GPU fabrics is increasing – from 8 GPU to 72 GPU clusters as seen with NVLink in the latest GB200 NVL72 rack-scale design by NVIDIA – limiting the use of the scale-out network for larger GPU cluster sizes. Successful standardization and deployments of UEC-defined scale-out networks will require all interplay with the scale-up network via the verbs API and CCL layers worked out, optimized.? Only then will the underlying layers be able to meet the vital performance and scale needs of AI Frameworks and Generative AI applications that use those frameworks.
Compatibility with Server and Rack-level Designs
Servers that contain GPUs for AI processing have evolved over time.? See high level view of components in such servers in figures 3(a) through 3(d). Modern servers with GPUs, as shown is figures 3(c) and 3(d), are unique, different from anything we have seen in the past.? The number of networking components and network types involved in implementing scale-out and scale-up networking is mind boggling.?
Figure 4 shows an example of a server design using NVIDIA H100 GPUs.? It is a real representation of figure 3(c), just double the number of components.? There is a one-to-one pairing of an RDMA NIC (NVIDIA ConnectX7 or CX7) to a GPU (NVIDIA H100), with 8 GPUs per server.? Combine that with 2 CPUs per server and additional front network networking needs, there can be up to 2 additional NICs per server, taking the total number of NICs serving GPUs and CPUs in an AI server to 10.? Many PCIe switches are used to enable such connectivity. The need for higher GPU processing scale driven by Generative AI applications is pushing the need for denser rack designs with more GPUs. The latest NVIDIA GB200 NVL72 rack level design is liquid cooled and houses 72 GPUs [7], with a very large number of networking components to enable data movement at scale between those GPUs. ?Figure 2(d) is a representation of the GB200 design where 4 GPUs and 2 CPUs connected using GPU fabric links (the ones between the GPUs and CPU is chip-to-chip) form a single compute node.
领英推荐
[Figure 4 courtesy: https://community.fs.com/article/nvlink-vs-pcie-selecting-the-ideal-option-for-nvidia-ai-servers.html; Table of network components per AI Server on the right hand side is by the author.]
The complexity of such server designs is likely causing companies like NVIDIA to create server designs (for example the MGX reference designs [6] and most recently a rack-level designs - the GB200 NVL72 design [7]) that are beginning to serve as “standard designs” that server OEMs and ODMs just follow.? These are tightly engineered systems, vertically integrated.? The entire software stack up to the CCL layer is optimized for those prescribed designs.? If a server OEM or ODM chooses to make a change to a prescribed design, they have to ensure the software stack and performance tuning do not break; that may be a tall order for many.?
Terms like Row Switch and Rail Switch are common in AI network designs. While these components do not reside inside a server, they are tightly coupled with the backend RDMA NICs which in turn are tightly coupled with the GPUs in the server.? This is shown in figure 5.? This one-to-one correspondence between the NICs and the Rail Switches (NIC-0 in all servers communicating via Rail Switch-0, NIC-1 in all servers communicating via Rail Switch-2 and so on) brings these switches closer to the NICs more than ever.? The need to improve congestion handling even further could result in collapsing of these layers that could impact the design of servers in the future.
Observations and Food for Thought
Historically, innovations in how packets are routed and switched in the networking infrastructure have helped Ethernet remain dominant versus competing technologies.? Will the UEC’s efforts to advance Ethernet technology for AI scale-out networking need a more ecosystem-focused approach?
Successful incumbents typically continue to extend the ecosystem capabilities for current technologies and solutions utilizing which they enjoy a large market share.? This can create deployment hurdles for challengers that utilize new technologies and standards [8].? NVIDIA, as the gorilla incumbent in this space, has been innovating at breakneck speed, leading the industry with new features and products, driving the ecosystem to move rapidly with it.? Will the company take a more prominent role to improve Ethernet for AI?
What is Next?
In the next post in this series which will be the last, I will revisit the observations and food for thought from the last two articles and this one, and offer my views on approaches we will need to take to solve technology and business challenges differently.
References:
Sales Account Manager / Product Manager
8 个月Thanks for your all incredible articles,this is the best and clearest networking introduction i have ever seen.