RDMA: Ethernet's Leap Forward
Generated by Bing Image Creator

RDMA: Ethernet's Leap Forward

RDMA is increasingly the transport protocol of choice for many workloads across AI Training/Inference, HPC Networks, Storage and a wide variety of East-West workloads across the datacenter, service provider and enterprise markets. In fact, Microsoft Azure runs 70% of its datacenter traffic (bits and packets) on RDMA ROCEv2 on commodity Ethernet (NSDI '23 - Empowering Azure Storage with RDMA ). This shift is accelerating across the landscape. For context, review my previous article: RDMA Network Trends.


An important note ( thanks, Shrijeet Mukherjee ) - both verbs based protocols like ROCEV2 and sockets based protocols like TCP and QUIC can and will continue to co-exist on the same network. This Ethernet evolution ahead needs to account for the two paradigms to co-exist.


This time, lets delve into key solutions shaping the next generation of Ethernet RDMA. Notably, every single technique discussed here has been implemented and deployed in production using either custom Ethernet-based protocols or through other protocols such as Infiniband or HPE-SlingShot.

None of these techniques are brand new, which is a reason for optimism. Ethernet is great at adapting rapidly :-)

The 8 techniques are listed below as problem-solution pairs.


1. Increasing Entropy

Problem Space:

Networks often suffer from the digital equivalent of a traffic jam, where certain paths are heavily used while others remain idle, akin to cars crowding one lane on a freeway while others are empty. This inefficiency, stemming from low entropy in network traffic distribution, leads to overuse of some paths and underuse of others. Entropy, in networking, is the measure of randomness in distributing network flows across various paths. High entropy, or more random-ness, results in a more efficient and balanced network, reducing congestion and bottlenecks.

Entropy distributes network "flows" across available paths. High entropy ensures a more balanced and efficient use of network resources, reducing the likelihood of congestion and bottlenecks, increasing the overall network utilization.

Equal-Cost Multi-Path Routing (ECMP) using the 5-Tuple which uniquely identifies the TCP/UDP is commonly employed to distribute flows across multiple paths. However, ECMP's effectiveness is limited in scenarios featuring dynamic traffic patterns or networks with varying path costs.

Solution Space:

Boosting entropy is vital for increasing network utilization, particularly in complex and high-radix networks, ensuring no single path becomes overly congested.

Many of these entropy related techniques can be applied both in Switches and Network Adapters.

  • Multi-Pathing: Utilizing multiple paths for data transmission increases entropy and prevents dependence on a single path. Incorporating additional L4+ fields into the ECMP entropy computation is common in silicon chips today. The fields used for the computation can vary based on the application ordering requirements.
  • Dynamic Load Balancing: This technique adjusts the distribution of network traffic in real-time based on current network conditions, effectively balancing the load across multiple paths. For example, Nvidia's Spectrum-X switches utilize adaptive routing techniques that dynamically adjust traffic paths based on real-time network conditions.
  • Packet Spraying in Switches: Packet spraying evenly distributes packets across available links within the switch fabric. Unlike ECMP, which operates on a flow basis, packet spraying deals with individual packets, thus dramatically increasing entropy and evenly distributing load across either all or a configured-set of fabric paths.
  • Packet Re-ordering Capability at the Receiving Network Adapter: Packet spraying and dynamic load balancing often lead to packets arriving out of order at the receiving Network Adapter, which then must put these packets back in the correct order. The packet re-ordering logic in the Receiving Network Adapter requires hardware-assist due to the performance requirements.

A good reference: To spray or not to spray : Validation (from Dmitry Shokarev, Principal PLM at Juniper Networks) discusses the performance improvements delivered by packet spraying followed by re-ordering at the receiving Network Adapter.

The adoption of techniques to increase entropy, coupled with re-ordering logic at the receiving network adapter, play a key role in optimizing network utilization and performance.


2. Flexible Delivery Order - leveraging application entropy

Problem Space:

In RDMA networks, particularly those utilizing RoCEv2, packet delivery adheres to strict ordering. While this approach ensures data integrity, many applications do not always need such strict ordering.

For example, AI/ML training computations often involve transmitting model weights, which do not require strict packet ordering. When ordering is enforced for applications that do not require it, entropy is reduced, which reduces the network efficiency.

Solution Space:

Introducing flexibility in packet delivery order within RDMA networks can significantly increase "application" entropy, leading to marked improvements in both network tail latencies and overall throughput.

Such flexibility improves network utilization. In the realm of AI workloads, which frequently involve collective operations like All-Reduce and All-to-All, the ability to flexibly order packet delivery accelerates the completion of tasks by enabling "parallel" data transfers without the need to reorder packets prior to application processing.

Supporting modern APIs with flexible ordering semantics is key to reducing tail latencies and enhancing the overall fabric efficiency. By allowing packets to be processed in a parallel manner when the application permits, these APIs provide enable efficient network communication.


3. Congestion Management Techniques

Problem Space:

Even in fully provisioned networks, and both the switch fabric and the end-points can get congested because of the traffic patterns. This congestion causes packet losses which requires packet retransmissions, resulting in increased latency and reduced fabric efficiency.

Packet loss in Ethernet environments typically manifests in three forms:

  • in-cast congestion
  • fabric congestion
  • out-cast congestion.

Updated congestion Diagram Illustrating the problem space


Outcast Congestion: Outcast congestion is caused by multiple senders attempting to transmit packets to the same outbound end-point link.

Fabric Congestion: Fabric congestion typically happens due to low entropy (i.e. not enough freedom or randomness in the path selection algorithms) - causing some links to be fully utilizing while others are idle. This is akin to a freeway with one lane jammed while others are empty.


In-cast Congestion: This occurs when many senders concurrently transmit to the same receiver, overwhelming the receiver.


Solution Space:

Handling Outcast Congestion Handling: This is the easiest of the three congestion scenarios to handle since all information is local to the node.

  • End-point nodes must use scheduling, traffic shaping and packet pacing to provide a "fair" service to all senders and at the same time, attempt to maximize the link utilization.
  • When there are multiple links available on a node, traffic shaping and scheduling algorithms then must distribute the network load more evenly across available links.

Handling Fabric Congestion: Multi-pathing is the key technique that reduces fabric congestion, increasing the utilization of the switch fabric .

  • Start by clearing the digital traffic jam by directing some of the traffic to other lanes; in technical terms, increase the entropy by multi-pathing or spraying the packets across some or all of the switch fabric links.
  • The switches can also locally make "congestion" aware multi-pathing decisions.
  • The schedulers at the senders do not know the switch fabric hotspots. If the switches send telemetry signals back to the sender end-point, the senders can also make dynamic path optimizations.
  • Shallow buffer techniques (EDQS Paper , written by Correct Networks, later acquired by Broadcom).) can help run a tight "congestion loop" between the switch and sender, enabling the sender to respond swiftly to emerging congestion situations.


Handling In-cast Congestion:

  • This is the most difficult scenario to handle since senders don't know the congestion state of the receiver link. The solution here is to deliver the congestion state to the receiver in a speedy manner.
  • The receiver can also provide explicit receiver credits, so the transmitter only sends packets when receiver credits (which map to receiver buffers) are available.
  • Again, shallow buffer techniques come to the rescue here, bounding the fabric delay and ensuring fast receiver driven control loop.
  • Such receiver-driven feedback informs the transmitter about the current load and congestion state at the receiver, allowing the transmitter to adjust its sending rate dynamically, ensuring that data is sent at a rate that the receiver can handle.

Deeper Look into Congestion Control

There are a range of congestion control (CC) algorithms available to mitigate these congestion scenarios.

  • Techniques like ECN (Explicit Congestion Notification) and DCQCN (Data Center Quantized Congestion Notification) adjust packet rates based on congestion signals from endpoints and switches.
  • Protocols like TIMELY utilize end-to-end delay measurements and fine-grained time-stamping to dynamically adapt packet rates, offering an alternate "delay-based" method for managing network congestion.

Courtesy: TIMELY paper


Additionally, it's feasible to employ both sets of CC techniques concurrently, thereby equipping the system with the capabilities to handle a diverse congestion scenarios effectively.

Switches and Network Adapters signal congestion notification (ECNs) in-band using extended packet headers - these signals can be sent in the forward direction to the receiver or in the backward direction to the sender. INT CC extended headers deliver detailed packet time-stamps and other telemetry measurements. Google Falcon uses CSIG ( IETF Draft ) which attaches fixed length summaries in a compact "compare-and-replace" manner along the path.

Some switches employ the Back to Sender (BTS) approach

  • BTS is a sub-RTT backward congestion signaling from the switch back to the sending Network Adapter.
  • The switch sends the BTS congestion signal prior to enqueuing the packet in the congested queue, so the delay is bounded by base (pre-congested) RTT.
  • BTS can be an alternative to packet trimming (covered later in #5)

(Thanks to Jeff Tantsura for comments on INT CC, CSIG and BTS)

This telemetry plays a vital role in congestion management and enhances fabric observability. Telemetry also aids in troubleshooting performance issues, enabling rapid fault recovery, and facilitating efficient debugging. By providing a holistic view of the network, telemetry offers operators the ability to monitor, analyze, and respond to network conditions proactively.

An important point

  • CC algorithms are implemented in software with hardware assist. The programmability is required to adapt the algorithms to the needs of the different workloads and the dynamic network conditions.
  • On the other hand, detection and signaling methods for CC Algorithms and telemetry at the packet or flow level must be implemented in hardware since software is too slow to respond relative to packet rates.

We need to get the balance between software and hardware right for the "targeted" usecases.


4. Edge-Queueing for Congestion Management

Problem Space:

Large buffers in network switches can be counter-productive, particularly in high-speed networks. They not only introduce delays but also obscure the detection of congestion within the switch fabric. This lack of visibility hampers timely congestion response, leading to increased packet drops, higher network latencies, and overall reduced network efficiency.

Solution Space:

Implementing shallow buffer strategies in network switches is crucial to mitigate these issues.

Courtesy: EDQS Paper. EQDS relocates most of the queues from the switch fabric to the edge of the sending host


See the EQDS Paper : which relocates most of the queuing from the core network to the leaf switch or sending host Network Adapter. EQDS uses shallow buffers in the switch fabric to help reduce latency and significantly enhance the visibility of congestion patterns across the switch fabric. This approach also facilitates the co-existence and evolution of multiple protocols such as TCP and ROCEv2 across the same network.


The next 3 techniques manage packet loss and retransmission.


5. Selective Retransmission

Problem Space:

Unlike InfiniBand, Ethernet are lossy in nature. Packets may get dropped frequently during congestion.

The congestion control techniques above minimize packet losses but typically cannot eliminate the losses even when the network is under-subscribed. Scheduling across a distributed fabric is extremely difficult to control, and burst of congestion do occur "temporally" even in undersubscribed networks.

In ROCEv2 RDMA networks, the conventional mechanism for addressing packet loss is the "Go-back-to-N" strategy. This method requires the sender to retransmit all packets from the point of the lost packet onwards, even if subsequent packets were received successfully. While simple in concept, this approach is inefficient. It wastes bandwidth by retransmitting packets already received and increases overall network latency.

Solution Space:

Selective retransmission is a significant enhancement over the "Go-Back-N" retransmission strategy. Only lost packets are retransmitted, some times called Selective ACKs or SACKs

Figure shows that Selective Retransmission minimizes the number of packets that need to be retransmitted


It employs Selective Acknowledgments (SACKs), which allow the receiver to specify exactly which packets have been successfully received. With this precise feedback, the sender needs only to retransmit the packets that were actually lost. This "selective" approach dramatically reduces the packet retransmissions.

By ensuring only the necessary packets are retransmitted, SACK based selective retransmission is a critical technique to improve tail latency and network efficiency.

This selective retransmission technique is much more effective with shallow buffer switches (cite EQDS), which have a faster loop feedback loop from transmitter --> receiver --> transmitter. Otherwise, the end-points see delays between lost packets and the retransmits.


6. Packet Trimming

Problem Space:

Packets loss detection relies on timeouts, which is too slow for modern high speed networks. These timeouts, designed to cover worst-case tail delays, are invariably lengthy, leading to delays and inefficiencies.

Solution Space:

An innovative and more efficient strategy is the implementation of receiver-driven feedback for packet trimming.

In scenarios where a switch's queue overflows, instead of dropping packets entirely, the switch trims the packet payload, forwarding only the packet header to the intended destination. This method is advantageous as the payload is usually much larger than the header, thereby significantly reducing bandwidth overhead.

The receiving Network Adapter, upon receiving the trimmed packet header, promptly issues a "Not Acknowledged" (NACK) signal back to the sender, along with additional congestion data enabling rapid sender feedback. Furthermore, the receiver can allocate "explicit" end-to-end credits for the retransmission of the packet.

In response to a trimmed packet NACK, the sender can reprioritize the retransmission of packets, often using a higher-priority traffic class (TC). This prioritization helps the retransmitted packets avoid being delayed by lower-priority congested queues, If the receiver provided end-to-end credits, then the receiving Network Adapter reserves "buffer space" to receive this retransmitted packet.

In summary, trimming packets and receiver-driven packet credit systems are great at prioritizing retransmitted packets and reserving space in receiver buffers, limiting further loss of retransmitted packets, thus improving tail latencies.

BTS mechanism discussed earlier can serve as an alternative to packet trimming and receiver driven credits. BTS has the advantage of signaling congestion earlier, before the switch congestion point.


7. Link Layer Retransmissions

Problem Space:

As network speeds increase, frame loss and errors at the link layer are becoming a bigger issue.

Lost frames result in the need packet retransmission; packet loss is often addressed through end-to-end retransmissions, a method that tends to have fabric wide impact and a longer feedback loop.

Solution Space:

The Link Layer Retransmissions (LLR) is a promising solution borrowed from the HPC space.

LLR operates on the principle of Negative Acknowledgements (NAKs) sent through the reverse channel when bit errors are detected. The transmitting device temporarily stores frames that are in transit (i.e., unacknowledged) in a local buffer. Upon receiving a NAK, it can rapidly retransmit the specific frame(s) in question.

Since retransmission is limited to the two nodes directly connected, it doesn't impact the fabric. This "localized" approach to error correction improves response time, and reduces the need for end-to-end retransmissions.


8. Encryption and Multi-Tenant Isolation

Problem Space:

Security is mandatory requirement across all multi-tenant networks in the cloud datacenter, and also in the enterprise and telecom markets.

Since encryption adds overheads, the performance-sensitive HPC and AI Training backend networks typically run without security encryption, and instead rely on datacenter physical security. For these applications, encryption can be treated as an optional requirement.

Security is a common requirement across all transport protocols such as ROCEv2, TCP and QUIC. That said, ROCEv2 RDMA in particular, has several specific security issues that must be fixed.

Solution Space:

Implementing encryption and ensuring multi-tenant isolation requires a nuanced approach, since security/isolation and high network performance are usually at odds.

TLS and IPsec encryption are two widely-used methods used to secure modern networks. Isolation is achieved across tenants by using virtualization. These encryption protocols must be accelerated in hardware within Network Adapters for performance reasons in 100G+ networks. Virtualization adds another layer of complexity.

Notably, Google has open-sourced the hardware-friendly PSP protocol, which is part of the open-sourced Falcon protocol.

Security is a particularly specialized topic, the realm of security experts. I'll look to learn from the security experts.

Final Notes

Summarizing the key techniques in the solution space

  1. Shallow Buffer Switches
  2. Multi-pathing
  3. Selective Retransmission
  4. Receiver Driven Control Loops

We explored a number of problem-solution pairs to improve modern RDMA networks. Many of these techniques have been borrowed from custom Ethernet implementations or from adjacent non-ethernet standards. The main job ahead for us is to adapt these "known" techniques to the open Ethernet protocol stack.

What major problem-solution pairs have I missed in this write-up?

It's noteworthy that:

  • UEC (Ultra Ethernet Consortium) is developing a new transport protocol, poised to replace ROCEv2. This initiative represents a significant step in advancing RDMA network technologies.
  • And, Google has made a substantial contribution to enriching RDMA technology by open-sourced its Falcon protocol.

These two modern transport protocols solve many of the {problem-solution pairs} described in this article. Both UEC and Falcon are valuable assets for on-going research and development in RDMA-based transport protocols.

As we continue to innovate, let's hope that a single "standard" addresses most of the common usecases. The thinking process should rely on first principles and stay simple. Otherwise, the actual implementations get very complex.

The RDMA landscape is rapidly transforming. The rise of AI is driving RDMA Ethernet innovation like it were a rocket-ship!


Links:

See the links section in RDMA Network Trends. Additional new links

  1. Congestion Control for Large-Scale RDMA Deployments
  2. SFC: Near-Source Congestion Signaling
  3. Datacenter RDMA: Issues at Hyperscale
  4. TIMELY: RTT-based Congestion Control
  5. Multi-Path Transport for RDMA in Datacenters
  6. Falcon at OCP Summit 2023
  7. Broadcom Blog - Ethernet fabric for AI at scale
  8. Gpu Fabrics Genai Workloads - LinkedIn Article By Sharada Yeluri (covers many of these methods with technical depth)

Glossary:

See Glossary section in RDMA Network Trends. Additional new glossary

  1. 5-Tuple: A group of five values (source IP address, source port, destination IP address, destination port, and protocol) used to identify and manage network traffic in ECMP.
  2. CSIG (Compact Signaling for In-band Network Telemetry): A protocol for attaching summaries of network conditions in a compact format. URL: CSIG - Congestion Signaling
  3. East-West Traffic: Network traffic that travels within a data center, often between servers.
  4. FIFO (First In, First Out): A queueing method where the first packet to arrive is the first to be processed.
  5. INT CC (In-band Network Telemetry Congestion Control): A method for delivering detailed delay and telemetry measurements for congestion control.
  6. NACK (Negative Acknowledgement): A signal sent to indicate that a packet was not received correctly and needs retransmission.
  7. Network Adapter/NIC (Network Interface Card): A device enabling computers to connect to a network. It manages data transmission and can be integrated into the motherboard or added as a separate card.
  8. PSP (Protected and Scalable Protocol): An open-source protocol developed by Google for enhanced network security.
  9. Adaptive Routing: A technique where the path of network traffic is dynamically adjusted based on real-time conditions.
  10. High-Radix Network: A network with a large number of paths between any two endpoints, often used in high-performance computing.
  11. RTT (Round Trip Time): RTT refers to the time it takes for a signal to be sent from a source to a destination, plus the time it takes for an acknowledgment of that signal to be received back at the source.
  12. Tail Latency: Refers to the delay in processing the slowest requests in a network system.
  13. Virtualization: The creation of virtual versions of network resources, often used for tenant isolation.


Disclaimer: The views and opinions articulated in this article solely represent the author's perspective and do not necessarily reflect the official stance of my employer or any other affiliated organization.

#RDMA, #Ethernet, #Networking, #TechInnovation

Mehran Reza

P.Eng | Senior Manager, Governance and Engineering of $1MM+ Capital Projects | Networking & IT Infrastructure

4 个月

This was a fascinating piece. Thank you for this.

回复
Nadav Yafe

VP BD and Sales @ Silicom Ltd. | Edge AI, Cloud, Cybersecurity Solutions

11 个月

Great reading. and also great reading Jeff Tantsura comments.

I would have thought that TSN would have a place in the solution space for congestion management chapter. it could at least deal with a subset of flows in the network. The rest of the flows (best effort) could be dealt with the techniques you mention.

Wei Bai

Principal Software Research Architect at NVIDIA

11 个月

You may find some interesting RDMA NIC behaviors in https://baiwei0427.github.io/papers/husky-nsdi2023.pdf and https://baiwei0427.github.io/papers/lumina-sigcomm2023.pdf. RDMA still has a long way to go.

Jeff Tantsura

Distinguished @Nvidia, building the best network and technologies to connect GPUs

11 个月

Rakesh - all well written and neatly worded, well done! After reading (actually rereading, most of the content is copied from someplace else), the mix of acronyms and technologies with somewhat unclear conclusions would benefit from some clarifications, please take it as such. Microsoft: what is actually said - Microsoft Azure has successfully deployed its IP fabrics (and to some degree WAN) that carry over 70% of its total traffic as RoCEv2, this is over lossless traffic class and in combination with DCQCN + anything else carried other Azure. This is a confirmation of a very successful technology deployment that resulted in huge savings, in more details - PFC watchdogs have been implemented by all vendors and work as expected, RoCEv2 fine-tuning has been standardized (also in SONiC) and is not an operational burden anymore. ECMP: has nothing to do with entropy, really a routing concept, meaning all the paths towards a particular destination are equdistant (and loop-free) in context of metrics used to compute the distance, in BGP case (Azure, OCI) - AS_PATH. TBC ->

要查看或添加评论,请登录