Should UEC and UAL Merge?

Should UEC and UAL Merge?

Is it necessary to have separate consortiums for scale-up and scale-out AI systems, or could the Ultra Ethernet Consortium (UEC) and Ultra Accelerator Link (UAL) merge under a single umbrella?

UEC is working to advance Ethernet technologies for scale-out AI/HPC applications. It aims to enhance Ethernet’s bandwidth, latency, and efficiency for scale-out systems by creating standards and hardware improvements that facilitate high-performance communication across thousands of interconnected nodes.

The UAL Consortium aims to deliver specifications and standards that allow industry players to develop high-speed interconnects for AI accelerators to scale up vertically and act like single, tightly coupled units.

The UAL specification for GPU to GPU interconnect was initially influenced by AMD's Infinity Fabric, which uses PCIe-like physical and data link layers to achieve ultra-low latencies. However, this strict low-latency requirement often results in interconnects operating at lower bandwidths than Ethernet counterparts. Nvidia recognized this early on and avoided PCIe semantics in its NVLink protocol; NVLink 5.0 lanes operate at 200?Gbps, three times the bandwidth of PCIe Gen6.

The scale-up fabric in distributed training/inference workloads primarily carries high-bandwidth tensor parallel traffic, which carries partial matrix multiplication results. There are ways for the frameworks to pipeline these multiplications or overlap some other compute while the results are being communicated. In the HPC workloads, another application for scale-up systems, GPU memories are pooled together to form a large unified memory pool. This approach faces challenges like GPU thread stalls due to cache misses when data resides in another GPU. Compilers can mitigate these stalls by overlapping computation with communication using well-known distributed computing techniques, easing latency demands.

The UAL team realized the importance of prioritizing bandwidth over latency and enabled Gen6 operation in extended mode at 128?Gbps with higher error rates that require stronger FEC. Rumors are that the UAL team is now shifting away from the PCIe approach toward Ethernet-style physical links to compete with 200Gbps lane speeds from Nvidia. Ethernet-style SerDes, with longer reach and high bandwidth enabled by PAM4 (and potentially PAM6 for 400G+ SerDes), require robust forward error correction to address higher error rates, which adds latency. However, the increased bandwidth allows more accelerators to connect to a single switch and enables interconnects to span across racks using copper cables, as demonstrated by Nvidia at GTC 2024 with a 72 GPU system that uses copper cables to connect between the servers for scale-up.

This raises the question: why not have a unified scale-up and scale-out mechanism? Can mechanisms developed under the UEC be leveraged for scale-up networks? UEC's new transport protocol runs on Ethernet/IP, but scale-up switching doesn't require IP routing; the IP routing overhead (total 66B, including the transport protocol header) is prohibitive for scale-up traffic mainly involving memory read/write and atomic operations. While UEC is considering a compressed header with around 50 bytes for HPC workloads, the overhead is still large. Moreover, using standard UEC-compliant Ethernet switches for scale-up switches is inefficient for area and power as these switches are co-located with high-power GPUs inside the servers and have strict power constraints.

An alternative is to use Ethernet-style SerDes for the physical layer (with robust FEC that can handle higher pre-FEC errors, better equalization techniques, and larger deskew buffers) and with custom transport protocols optimized for memory operations, similar to Nvidia's NVLink protocol. NVLink defines read/write and atomic operations with 64B–256B (up to 1,000B in later generations?) flit transfers, using 16B headers for commands, CRC, and control fields. This results in about 94% efficiency for 256B transfers, compared to 80% efficiency with Ethernet links. CXL has similar semantics for memory operations. Any new protocol would likely adopt similar semantics for flit transfers between GPU memories.

In addition to being ahead in technology for scale-up fabric, Nvidia's advantage lies in a unified software framework and APIs extending from scale-up to scale-out, such as Scalable Hierarchical Aggregation and Reduction Protocol (SHARP). The primary goal of SHARP (version 3 is the latest) is to offload and accelerate complex collective operations directly within the scale-up and scale-out network, reducing the amount of data that needs to be transferred over the network and thus decreasing the overall communication time. SHARP is supported in NVLink switches and scale-out Quantum InfiniBand switches. They may soon add support for this in their ethernet switches as well.

UEC is developing an In-Network Collectives (INC) specification as an alternative. UAL may need to define one as well. Having both under the same umbrella enables unified software API development for INC and other SW APIs, leveraging similar components across scale-up and scale-out networks. Some UEC hardware features under the HPC profile—like link-level retry, credit-based data transmission, and encryption/decryption standards—could also be leveraged for scale-up.

Broadcom left UAL, and the rumors are that they did not agree with the initial approach of using PCIe-style interconnects. Now that the direction may have changed for UAL, will Broadcom rejoin UAL or move on to develop scale-up spec, either on their own or through UEC? If the latter were to happen, the scale-up space would remain fragmented, delaying widespread adoption.

Competing with Nvidia's integrated approach becomes challenging without industry alignment on scale-up and scale-out specifications. It would benefit both consortiums to either merge into a single consortium or collaborate and accelerate the release of open standards for scale-up and scale-out systems.

It is high time the initial UEC /UAL drafts are released for broader review and discussion!


Disclaimer

I am a Juniper employee, but the opinions expressed here are my own and do not necessarily reflect those of my employer.


Merging consortiums can lead to innovative solutions and faster advancements in technology. It would be interesting to examine the potential synergies and benefits of combining the expertise of UEC and UAL. What specific challenges do you think this merger could address in the networking space?

回复
Siamak Tavallaei

Sr Principal Engineer, System Architecture, Samsung; Ex-President & Advisor to the Board, CXL Consortium; Steering Committee, OCP; Chief Systems Architect, Ex-Google; Ex-Microsoft/Azure; Ex-HP/Compaq

4 天前

The scale up and scale out delineation in this context circles around radix (port count), latency, and "word" size. We can draw corollaries from Little's Law to analyze the consequences. The scale-up world, built on "native" xPU interconnect, would like to reduce latency (bit rate, distance, queuing/pinchimg delays, congestion delays due to transaction blocking, switch hop-count etc), and therefore favors "many links" for direct point-to-point links (with modest bandwidth); while scale-out world uses multiple interconnecting switches to help move data among many more xPU elements. Switch-to-switch links (ISL) tend to be "fat pipes" with much more bandwidth albeit longer latency (because we run out of ports for direc PtP links)! https://en.wikipedia.org/wiki/Little%27s_law

回复
Robert Hormuth

Corporate Vice President, Architecture and Strategy, Data Center Solutions Group at AMD

1 周

There is an old saying “never bet against Ethernet”, but that statement is missing a disclaimer “in a packet world”. Ethernet is used to move a packet from A to B, not operate on it. CPU/GPUs operate in a Load /Store world, actually operating on data not just passing it along. There is a reason we don’t use Ethernet to attach local DRAM. So, “never bet against Ethernet in a packet world” and “never overload the packet world with memory semantics”. One world lives in the nano second world and the other is trying to stay in the micro second world. Google put out a great paper, “Attack of the Killer Micro Second” that is worth a read. https://research.google/pubs/attack-of-the-killer-microseconds/ In net, UEC and UALink address totally different problems. Trying to overload one or the other makes them optimal for neither.

Xingjun (Jason) Chu

Ph.D Distinguished Engineer at Huawei Canada

1 周

Great post! thanks Sharada. keep them coming!

回复

Don't know about merging consortia, but it would be nice to have a single common first layer of interconnect topology, at least physically, instead of separate ones like we have now. Should be possible even while keeping different semantics - e.g. by defining tunneling of UEC traffic over UAL and attaching UEC uplinks to UAL switches. This would allow any accelerator to use any/all UEC uplinks from UAL domain to reach any other accelerator in the cluster even with disjoint rails/planes - just like in the normal multi-stage interconnect. Essentially rail-only done right.

要查看或添加评论,请登录