Optimizing AI/ML and HPC Workloads: Exploring RDMA (RoCEv2) for High-Performance Data Center Networking
GPU to GPU over RoCEv2

Optimizing AI/ML and HPC Workloads: Exploring RDMA (RoCEv2) for High-Performance Data Center Networking

In the realm of growing Artificial Intelligence (AI) and Machine Learning (ML) applications, the demand for High-Performance Compute (HPC) systems within Data Centers is escalating rapidly. HPC necessitates swift and seamless communication characterized by high-speed and low-latency connections. This raises a fundamental question: Can traditional TCP/IP stacks effectively support HPC network communication?

Remote Direct Memory Access (RDMA), a renowned technology utilized in HPC and storage networking environments, offers high throughput and low-latency information transfer between nodes directly at the memory-to-memory level, without imposing a load on the CPU. By enabling direct access to memory on one computer from another, RDMA facilitates seamless data transfers as if the accessing computer were interacting with its own local memory. This transfer operation is offloaded to the network adapter hardware, bypassing the operating system software network stack.

Traditional Mode vs RDMA Mode

In the realm of HPC, networks leverage the InfiniBand (IB) stack, which offers the full benefits of RDMA, including high throughput, low latency, and CPU bypass. Additionally, InfiniBand integrates congestion management directly into the protocol. In contrast, the traditional TCP/IP stack increases CPU consumption with higher network access bandwidth, resulting in increased network transmission latency, a scenario clearly unsuitable for HPC requirements.

RDMA Network Protocols:

  1. InfiniBand (Native RDMA): This native RDMA technology offers a channel-based P2P message queue forwarding model. Through this model, applications gain direct access to messages via virtual channels, bypassing the need for OS and other stacks. This enables RDMA read and write access between nodes by offloading CPU workload. Furthermore, InfiniBand's link layer integrates specific retransmission mechanisms to uphold QoS, eliminating the necessity for data buffering. It requires dedicated InfiniBand switches and NICs, optimizing performance in HPC environments.
  2. iWARP (RDMA over TCP): The Internet Wide Area RDMA Protocol (iWARP) enables RDMA operations over TCP, providing RDMA functionality over standard Ethernet infrastructure. This allows organizations to utilize their existing Ethernet switches for RDMA purposes and leverages the packet loss protection mechanisms of TCP. However, Network Interface Cards (NICs) must support iWARP, especially when utilizing CPU offloading techniques.
  3. RoCEv1 (RDMA over L2 Ethernet): RoCEv1 is an RDMA protocol operating on the Ethernet link layer. It facilitates communication between any two hosts within the same Ethernet broadcast domain. For reliable transmission at the physical layer, switches must support technologies like PFC (Priority Flow Control) and other flow control mechanisms.
  4. RoCEv2 (RDMA over UDP): RoCEv2 addresses the limitations of v1, which is confined to a single VLAN. It enables usage across both L2 and L3 networks by adjusting packet encapsulation, including IP and UDP headers.

RDMA Protocol stack

RoCEv2 in Hyperscale Datacenters:

In the fast-paced realm of hyperscale data centers, the pursuit of high-performance networking solutions has sparked increasing interest in RoCEv2. RoCEv2 represents a fusion of InfiniBand's performance benefits with Ethernet's widespread accessibility, enabling seamless RDMA functionality across existing Ethernet infrastructure. Leveraging a Converged Ethernet infrastructure, RoCEv2 facilitates the coexistence of traditional Ethernet traffic with RDMA traffic on the same network, streamlining network management and obviating the need for a separate RDMA fabric. Nonetheless, deploying RoCEv2 in converged Ethernet fabrics poses challenges such as ensuring lossless and low-latency communication through the allocation of necessary network resources, optimization of packet encapsulation over UDP, and implementation of effective congestion control mechanisms like Priority Flow Control (PFC) and Data Center Quantized Congestion Notification (DCQCN). This article delves into the intricacies of RoCEv2, examining its encapsulation over UDP and proposing strategies for resource allocation and congestion control.

RoCEv2

To utilize RoCEv2, specialized RDMA NICs (RNICs) capable of RDMA are required on both the source and destination hosts. The physical (PHY) speed of RDMA cards typically starts from 50Gbps and is currently available at speeds up to 400Gbps.

RoCEv2 Packet Format:

To ensure seamless transport of RDMA traffic over IP and UDP Layer 3 Ethernet networks, packet encapsulation is crucial. The dedicated UDP destination port 4791 signifies InfiniBand payload, while utilizing different source ports for various Queue Pairs (QP) enables ECMP load sharing for optimized forwarding.

Now, let's delve into the specifics:

  • RoCEv2 operates on top of the IPv4/UDP or IPv6/UDP protocol, replacing the InfiniBand Network layer with IP and UDP headers at the Ethernet link layer, thereby enabling routing.
  • It defaults to the well-known UDP destination port number 4791.
  • The UDP source port servers as a flow identifier and can optimize packet forwarding with ECMP.
  • RoCEv2 flow and congestion control utilize Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) bits in the IP header to manage congestions, along with CNP (Congestion Notification Packet) frames for acknowledgments.

RoCEv2 Packet Format
RoCEv2 Wireshark Capture

RoCEv2 provides remarkable versatility at a reduced cost, making it optimal for constructing high-performance RDMA networks within traditional Ethernet environments. However, the preparedness of commodity switches is paramount; configuring parameters like Headroom, PFC, and ECN settings on these switches can be complex. It demands careful attention to guarantee that the optimal configurations are established to attain peak performance. Factors such as congestion and routing significantly influence bandwidth and latency in high-performance networks.

Readiness for RoCEv2 Implementation in Ethernet Networks:

In order to unleash the true performance of RDMA, it is necessary to build a lossless network. Implementing RoCEv2 on regular Ethernet switches requires careful consideration of several key factors to ensure optimal performance and compatibility. Here are some important considerations:

  1. MTU (Maximum Transmission Unit): RoCEv2 requires a larger MTU size compared to traditional Ethernet traffic to accommodate the additional RDMA headers. A minimum MTU of 9000 bytes is typically recommended to avoid fragmentation and ensure efficient data transfer.
  2. QoS (Quality of Service): Implementing Quality of Service (QoS) mechanisms is essential to prioritize RoCEv2 traffic and ensure low-latency communication. This involves configuring switch queues and scheduling algorithms to give RDMA traffic higher priority over other network traffic. Differentiated Services Code Point (DSCP) marking allows for more granular QoS control by classifying traffic based on its priority level, prioritizing RoCEv2 traffic according to its importance in the network. Ethernet switches should support DSCP-based QoS to effectively manage RoCEv2 traffic and maintain optimal network performance.
  3. PFC (Priority Flow Control): Priority Flow Control (PFC) is crucial for creating lossless Ethernet networks, as it prevents packet loss and ensures reliable transmission of RDMA traffic. Switches must support PFC based on IEEE 802.1Qbb standards to enable lossless operation for RoCEv2.
  4. ECN (Explicit Congestion Notification): ECN, particularly in the form of Data Center Quantized Congestion Notification (DCQCN), plays a vital role in managing congestion and maintaining optimal network performance. Switches must support ECN mechanisms to enable efficient congestion control for RoCEv2 traffic.

Now that we've explored the key considerations for implementing RoCEv2 on Ethernet networks, let's delve deeper into two critical components: Priority Flow Control (PFC) and Data Center Quantized Congestion Notification (DCQCN). These mechanisms play a crucial role in ensuring lossless communication, managing congestion effectively, and maintaining optimal network performance for RoCEv2 traffic. Let's explore how PFC and DCQCN contribute to the seamless integration of RDMA technology in Ethernet environments.

PFC - Priority based flow control:

Priority Flow Control (PFC) is an IEEE 802.1Qbb Link-Layer Flow-Control protocol designed to maintain a drop-free network environment. PFC enables receivers to assert flow control by directing transmitters to momentarily halt transmitting traffic for a particular priority level. It enhances the precision of flow control from physical ports to 8 virtual channels, aligning with 8 hardware queues (Traffic classes: TC0, TC1 ... TC7). It leverages DSCP to enable autonomous flow control for diverse traffic flows.

PFC Operation

As depicted in the above diagram,

  • When the switch buffer nears overflow (indicated by the XOFF threshold, signaling high buffer utilization in specific priority queue), the switch dispatches PFC PAUSE frames to alert upstream ports to halt data transmission.
  • Upon buffer usage declining below the XON threshold, the switch prompts the upstream port to RESUME traffic, signaling eased congestion.
  • Headroom denotes additional buffer space reserved to accommodate packets in transit.

PFC operates on a per class-of-service (CoS) basis. During periods of congestion, PFC initiates the transmission of a pause frame specifying the CoS value requiring suspension. As illustrated in the diagram below, each PFC pause frame includes a 2-octet timer value for every CoS, indicating the duration for which traffic should be suspended. The timer is measured in pause quanta, where a quanta represents the time required to transmit 512 bits at the port's speed, ranging from 0 to 65535. A pause frame with a quanta of 0 signifies a resumption of traffic, prompting the recommencement of paused traffic flow. PFC directs the peer to cease sending frames of a specific CoS value by dispatching a pause frame to a designated address. This pause frame, operating within a single hop, does not propagate beyond the recipient peer. Once congestion alleviates, PFC can request the peer to recommence data transmission.

PFC Frame Format

PFC's drawback lies in its potential to halt all traffic within a particular traffic-class at the ingress port, consequently blocking flow to other ports. Common issues associated with PFC include head-of-line (HoL) blocking, unfairness, and deadlock situations. These issues significantly reduce the throughput, latency, and utilization performance of RoCEv2. Therefore, RoCEv2 requires a end-to-end per-flow congestion control to adjust the flow rates, eliminating congestion quickly and minimize the frequent triggering of PFC.

DCQCN - Congestion control with ECN:

Data Center Quantized Congestion Notification (DCQCN) serves as an end-to-end congestion control mechanism tailored for RoCEv2. This integration of ECN and PFC aims to enable drop-free Ethernet connectivity across the network. The concept behind DCQCN involves leveraging ECN to implement flow control by reducing the transmission rate at the sender upon congestion onset, effectively minimizing the need for PFC intervention.

In DCQCN, the switch acts as a Congestion Point (CP) that detects congestion by monitoring the queue lengths and flags via the ECN field. Switches use the RED mechanism to probabilistically mark data packets with ECN based on queue length. The receiver acts as a Notification Point (NP), generating the Congestion Notification Packets (CNPs), and sending them directly to the sender. Then, the sender acts as a Reaction Point (RP), reducing the flow rate if a CNP is received within a control period; otherwise, it increases the flow rate determined by timers and byte counters.

DCQCN Operation

ECN uses the two least significant (right-most) bits of the Traffic Class field in the IPv4 or IPv6 header to encode four different code points:

  • 0x00 – Non ECN-Capable Transport (Non-ECT)
  • 0x10 – ECN Capable Transport 0 (ECT-0)
  • 0x01 – ECN Capable Transport 1 (ECT-1)
  • 0x11 – Congestion Encountered (CE)

In case of congestion, the network device re-mark the packet with ECN Congestion Encountered (0x11) but does not send anything to the sender. The re-marked packet arrives to the destination and the destination send a notification to the sender to reduce the traffic.

All the switches or routers along the path need to support ECN.

Leveraging combination of PFC and DCQCN:

In dynamic network environments, the combined use of PFC and DCQCN optimizes RDMA performance. DCQCN effectively mitigates congestion patterns, such as incast, by signaling congestion anywhere along the data path to endpoints. Meanwhile, PFC efficiently manages congestion caused by burst applications near endpoints by slowing down senders. In this setup, DCQCN serves as the primary congestion management mechanism, with PFC acting as the fail-safe solution.

Conclusion:

In conclusion, the evolution of RDMA protocols such as RoCEv2 presents promising opportunities for hyperscale data centers seeking high-performance networking solutions. By leveraging RoCEv2 over converged Ethernet fabrics, data center operators can achieve seamless RDMA functionality without the need for separate RDMA fabrics, streamlining network management and reducing costs. However, successful implementation of RoCEv2 requires addressing challenges related to lossless and low-latency communication, resource allocation, and congestion control. By carefully considering factors such as MTU size, QoS mechanisms, PFC, and ECN settings, data center operators can harness the full potential of RoCEv2 to enhance performance and scalability in hyperscale environments.

This article sounds like a goldmine for anyone navigating the complexities of optimizing AI/ML and HPC workloads in data center networking. RDMA (RoCEv2) seems to be a game-changer in this arena, promising enhanced efficiency and performance. As the demand for high-performance computing continues to soar, understanding how to leverage technologies like RDMA becomes increasingly vital. Looking forward to diving into the insights shared in the article!

要查看或添加评论,请登录

Ravi Kishore Chitakani的更多文章

社区洞察

其他会员也浏览了