RDMA: Ethernet's Leap Forward
RDMA is increasingly the transport protocol of choice for many workloads across AI Training/Inference, HPC Networks, Storage and a wide variety of East-West workloads across the datacenter, service provider and enterprise markets. In fact, Microsoft Azure runs 70% of its datacenter traffic (bits and packets) on RDMA ROCEv2 on commodity Ethernet (NSDI '23 - Empowering Azure Storage with RDMA ). This shift is accelerating across the landscape. For context, review my previous article: RDMA Network Trends.
An important note ( thanks, Shrijeet Mukherjee ) - both verbs based protocols like ROCEV2 and sockets based protocols like TCP and QUIC can and will continue to co-exist on the same network. This Ethernet evolution ahead needs to account for the two paradigms to co-exist.
This time, lets delve into key solutions shaping the next generation of Ethernet RDMA. Notably, every single technique discussed here has been implemented and deployed in production using either custom Ethernet-based protocols or through other protocols such as Infiniband or HPE-SlingShot.
None of these techniques are brand new, which is a reason for optimism. Ethernet is great at adapting rapidly :-)
The 8 techniques are listed below as problem-solution pairs.
1. Increasing Entropy
Problem Space:
Networks often suffer from the digital equivalent of a traffic jam, where certain paths are heavily used while others remain idle, akin to cars crowding one lane on a freeway while others are empty. This inefficiency, stemming from low entropy in network traffic distribution, leads to overuse of some paths and underuse of others. Entropy, in networking, is the measure of randomness in distributing network flows across various paths. High entropy, or more random-ness, results in a more efficient and balanced network, reducing congestion and bottlenecks.
Entropy distributes network "flows" across available paths. High entropy ensures a more balanced and efficient use of network resources, reducing the likelihood of congestion and bottlenecks, increasing the overall network utilization.
Equal-Cost Multi-Path Routing (ECMP) using the 5-Tuple which uniquely identifies the TCP/UDP is commonly employed to distribute flows across multiple paths. However, ECMP's effectiveness is limited in scenarios featuring dynamic traffic patterns or networks with varying path costs.
Solution Space:
Boosting entropy is vital for increasing network utilization, particularly in complex and high-radix networks, ensuring no single path becomes overly congested.
Many of these entropy related techniques can be applied both in Switches and Network Adapters.
A good reference: To spray or not to spray : Validation (from Dmitry Shokarev, Principal PLM at Juniper Networks) discusses the performance improvements delivered by packet spraying followed by re-ordering at the receiving Network Adapter.
The adoption of techniques to increase entropy, coupled with re-ordering logic at the receiving network adapter, play a key role in optimizing network utilization and performance.
2. Flexible Delivery Order - leveraging application entropy
Problem Space:
In RDMA networks, particularly those utilizing RoCEv2, packet delivery adheres to strict ordering. While this approach ensures data integrity, many applications do not always need such strict ordering.
For example, AI/ML training computations often involve transmitting model weights, which do not require strict packet ordering. When ordering is enforced for applications that do not require it, entropy is reduced, which reduces the network efficiency.
Solution Space:
Introducing flexibility in packet delivery order within RDMA networks can significantly increase "application" entropy, leading to marked improvements in both network tail latencies and overall throughput.
Such flexibility improves network utilization. In the realm of AI workloads, which frequently involve collective operations like All-Reduce and All-to-All, the ability to flexibly order packet delivery accelerates the completion of tasks by enabling "parallel" data transfers without the need to reorder packets prior to application processing.
Supporting modern APIs with flexible ordering semantics is key to reducing tail latencies and enhancing the overall fabric efficiency. By allowing packets to be processed in a parallel manner when the application permits, these APIs provide enable efficient network communication.
3. Congestion Management Techniques
Problem Space:
Even in fully provisioned networks, and both the switch fabric and the end-points can get congested because of the traffic patterns. This congestion causes packet losses which requires packet retransmissions, resulting in increased latency and reduced fabric efficiency.
Packet loss in Ethernet environments typically manifests in three forms:
Outcast Congestion: Outcast congestion is caused by multiple senders attempting to transmit packets to the same outbound end-point link.
Fabric Congestion: Fabric congestion typically happens due to low entropy (i.e. not enough freedom or randomness in the path selection algorithms) - causing some links to be fully utilizing while others are idle. This is akin to a freeway with one lane jammed while others are empty.
In-cast Congestion: This occurs when many senders concurrently transmit to the same receiver, overwhelming the receiver.
Solution Space:
Handling Outcast Congestion Handling: This is the easiest of the three congestion scenarios to handle since all information is local to the node.
Handling Fabric Congestion: Multi-pathing is the key technique that reduces fabric congestion, increasing the utilization of the switch fabric .
Handling In-cast Congestion:
Deeper Look into Congestion Control
There are a range of congestion control (CC) algorithms available to mitigate these congestion scenarios.
Additionally, it's feasible to employ both sets of CC techniques concurrently, thereby equipping the system with the capabilities to handle a diverse congestion scenarios effectively.
Switches and Network Adapters signal congestion notification (ECNs) in-band using extended packet headers - these signals can be sent in the forward direction to the receiver or in the backward direction to the sender. INT CC extended headers deliver detailed packet time-stamps and other telemetry measurements. Google Falcon uses CSIG ( IETF Draft ) which attaches fixed length summaries in a compact "compare-and-replace" manner along the path.
Some switches employ the Back to Sender (BTS) approach
(Thanks to Jeff Tantsura for comments on INT CC, CSIG and BTS)
This telemetry plays a vital role in congestion management and enhances fabric observability. Telemetry also aids in troubleshooting performance issues, enabling rapid fault recovery, and facilitating efficient debugging. By providing a holistic view of the network, telemetry offers operators the ability to monitor, analyze, and respond to network conditions proactively.
An important point
We need to get the balance between software and hardware right for the "targeted" usecases.
4. Edge-Queueing for Congestion Management
Problem Space:
Large buffers in network switches can be counter-productive, particularly in high-speed networks. They not only introduce delays but also obscure the detection of congestion within the switch fabric. This lack of visibility hampers timely congestion response, leading to increased packet drops, higher network latencies, and overall reduced network efficiency.
Solution Space:
Implementing shallow buffer strategies in network switches is crucial to mitigate these issues.
See the EQDS Paper : which relocates most of the queuing from the core network to the leaf switch or sending host Network Adapter. EQDS uses shallow buffers in the switch fabric to help reduce latency and significantly enhance the visibility of congestion patterns across the switch fabric. This approach also facilitates the co-existence and evolution of multiple protocols such as TCP and ROCEv2 across the same network.
The next 3 techniques manage packet loss and retransmission.
5. Selective Retransmission
Problem Space:
Unlike InfiniBand, Ethernet are lossy in nature. Packets may get dropped frequently during congestion.
The congestion control techniques above minimize packet losses but typically cannot eliminate the losses even when the network is under-subscribed. Scheduling across a distributed fabric is extremely difficult to control, and burst of congestion do occur "temporally" even in undersubscribed networks.
In ROCEv2 RDMA networks, the conventional mechanism for addressing packet loss is the "Go-back-to-N" strategy. This method requires the sender to retransmit all packets from the point of the lost packet onwards, even if subsequent packets were received successfully. While simple in concept, this approach is inefficient. It wastes bandwidth by retransmitting packets already received and increases overall network latency.
Solution Space:
Selective retransmission is a significant enhancement over the "Go-Back-N" retransmission strategy. Only lost packets are retransmitted, some times called Selective ACKs or SACKs
It employs Selective Acknowledgments (SACKs), which allow the receiver to specify exactly which packets have been successfully received. With this precise feedback, the sender needs only to retransmit the packets that were actually lost. This "selective" approach dramatically reduces the packet retransmissions.
By ensuring only the necessary packets are retransmitted, SACK based selective retransmission is a critical technique to improve tail latency and network efficiency.
This selective retransmission technique is much more effective with shallow buffer switches (cite EQDS), which have a faster loop feedback loop from transmitter --> receiver --> transmitter. Otherwise, the end-points see delays between lost packets and the retransmits.
6. Packet Trimming
Problem Space:
Packets loss detection relies on timeouts, which is too slow for modern high speed networks. These timeouts, designed to cover worst-case tail delays, are invariably lengthy, leading to delays and inefficiencies.
Solution Space:
An innovative and more efficient strategy is the implementation of receiver-driven feedback for packet trimming.
In scenarios where a switch's queue overflows, instead of dropping packets entirely, the switch trims the packet payload, forwarding only the packet header to the intended destination. This method is advantageous as the payload is usually much larger than the header, thereby significantly reducing bandwidth overhead.
The receiving Network Adapter, upon receiving the trimmed packet header, promptly issues a "Not Acknowledged" (NACK) signal back to the sender, along with additional congestion data enabling rapid sender feedback. Furthermore, the receiver can allocate "explicit" end-to-end credits for the retransmission of the packet.
In response to a trimmed packet NACK, the sender can reprioritize the retransmission of packets, often using a higher-priority traffic class (TC). This prioritization helps the retransmitted packets avoid being delayed by lower-priority congested queues, If the receiver provided end-to-end credits, then the receiving Network Adapter reserves "buffer space" to receive this retransmitted packet.
In summary, trimming packets and receiver-driven packet credit systems are great at prioritizing retransmitted packets and reserving space in receiver buffers, limiting further loss of retransmitted packets, thus improving tail latencies.
BTS mechanism discussed earlier can serve as an alternative to packet trimming and receiver driven credits. BTS has the advantage of signaling congestion earlier, before the switch congestion point.
7. Link Layer Retransmissions
Problem Space:
As network speeds increase, frame loss and errors at the link layer are becoming a bigger issue.
Lost frames result in the need packet retransmission; packet loss is often addressed through end-to-end retransmissions, a method that tends to have fabric wide impact and a longer feedback loop.
Solution Space:
The Link Layer Retransmissions (LLR) is a promising solution borrowed from the HPC space.
LLR operates on the principle of Negative Acknowledgements (NAKs) sent through the reverse channel when bit errors are detected. The transmitting device temporarily stores frames that are in transit (i.e., unacknowledged) in a local buffer. Upon receiving a NAK, it can rapidly retransmit the specific frame(s) in question.
Since retransmission is limited to the two nodes directly connected, it doesn't impact the fabric. This "localized" approach to error correction improves response time, and reduces the need for end-to-end retransmissions.
8. Encryption and Multi-Tenant Isolation
Problem Space:
Security is mandatory requirement across all multi-tenant networks in the cloud datacenter, and also in the enterprise and telecom markets.
Since encryption adds overheads, the performance-sensitive HPC and AI Training backend networks typically run without security encryption, and instead rely on datacenter physical security. For these applications, encryption can be treated as an optional requirement.
Security is a common requirement across all transport protocols such as ROCEv2, TCP and QUIC. That said, ROCEv2 RDMA in particular, has several specific security issues that must be fixed.
Solution Space:
Implementing encryption and ensuring multi-tenant isolation requires a nuanced approach, since security/isolation and high network performance are usually at odds.
TLS and IPsec encryption are two widely-used methods used to secure modern networks. Isolation is achieved across tenants by using virtualization. These encryption protocols must be accelerated in hardware within Network Adapters for performance reasons in 100G+ networks. Virtualization adds another layer of complexity.
Notably, Google has open-sourced the hardware-friendly PSP protocol, which is part of the open-sourced Falcon protocol.
Security is a particularly specialized topic, the realm of security experts. I'll look to learn from the security experts.
Final Notes
Summarizing the key techniques in the solution space
We explored a number of problem-solution pairs to improve modern RDMA networks. Many of these techniques have been borrowed from custom Ethernet implementations or from adjacent non-ethernet standards. The main job ahead for us is to adapt these "known" techniques to the open Ethernet protocol stack.
What major problem-solution pairs have I missed in this write-up?
It's noteworthy that:
These two modern transport protocols solve many of the {problem-solution pairs} described in this article. Both UEC and Falcon are valuable assets for on-going research and development in RDMA-based transport protocols.
As we continue to innovate, let's hope that a single "standard" addresses most of the common usecases. The thinking process should rely on first principles and stay simple. Otherwise, the actual implementations get very complex.
The RDMA landscape is rapidly transforming. The rise of AI is driving RDMA Ethernet innovation like it were a rocket-ship!
Links:
See the links section in RDMA Network Trends. Additional new links
Glossary:
See Glossary section in RDMA Network Trends. Additional new glossary
Disclaimer: The views and opinions articulated in this article solely represent the author's perspective and do not necessarily reflect the official stance of my employer or any other affiliated organization.
#RDMA, #Ethernet, #Networking, #TechInnovation
P.Eng | Senior Manager, Governance and Engineering of $1MM+ Capital Projects | Networking & IT Infrastructure
4 个月This was a fascinating piece. Thank you for this.
VP BD and Sales @ Silicom Ltd. | Edge AI, Cloud, Cybersecurity Solutions
11 个月Great reading. and also great reading Jeff Tantsura comments.
I would have thought that TSN would have a place in the solution space for congestion management chapter. it could at least deal with a subset of flows in the network. The rest of the flows (best effort) could be dealt with the techniques you mention.
Principal Software Research Architect at NVIDIA
11 个月You may find some interesting RDMA NIC behaviors in https://baiwei0427.github.io/papers/husky-nsdi2023.pdf and https://baiwei0427.github.io/papers/lumina-sigcomm2023.pdf. RDMA still has a long way to go.
Distinguished @Nvidia, building the best network and technologies to connect GPUs
11 个月Rakesh - all well written and neatly worded, well done! After reading (actually rereading, most of the content is copied from someplace else), the mix of acronyms and technologies with somewhat unclear conclusions would benefit from some clarifications, please take it as such. Microsoft: what is actually said - Microsoft Azure has successfully deployed its IP fabrics (and to some degree WAN) that carry over 70% of its total traffic as RoCEv2, this is over lossless traffic class and in combination with DCQCN + anything else carried other Azure. This is a confirmation of a very successful technology deployment that resulted in huge savings, in more details - PFC watchdogs have been implemented by all vendors and work as expected, RoCEv2 fine-tuning has been standardized (also in SONiC) and is not an operational burden anymore. ECMP: has nothing to do with entropy, really a routing concept, meaning all the paths towards a particular destination are equdistant (and loop-free) in context of metrics used to compute the distance, in BGP case (Azure, OCI) - AS_PATH. TBC ->