Ethernet’s Journey: From Lossy to Lossless - Part 2: Enhanced Ethernet with RoCEv2
Amit Godbole
Working for Customer Success @ Marvell | AI Infrastructure Enthusiast | Network Switching Systems Solutions
Part 1 of the article discussed why Ethernet is considered a lossy technology and introduces InfiniBand, a network fabric designed for lossless performance. It highlights InfiniBand’s key features, such as subnet management, QoS support through virtual lanes, credit-based flow control, data integrity with dual CRCs, and the use of Queue Pairs with RDMA, making it ideal for high-performance computing (HPC) applications.
RDMA and Ethernet
Ethernet has a broader vendor ecosystem with many suppliers, making it easier to find compatible hardware and software solutions. To achieve InfiniBand-like high performance coupled with low latency in Ethernet networks, the InfiniBand Trade Association (IBTA) introduced RDMA over Converged Ethernet (RoCE). This adapts InfiniBand's RDMA layer 4 protocol for use with the Ethernet.
RoCE
By replacing InfiniBand’s fabric-specific Layer 2 with Ethernet, RoCE provides the same RDMA service and semantics for Ethernet networks. RoCE supports carrying InfiniBand over Ethernet to implement RDMA over Ethernet.
Applications coded with OFA verbs run smoothly on the host Ethernet fabric with RoCE. The Ethernet fabric uses an Ethernet Layer 2 header, then an InfiniBand Layer 3 header. It forms a single Ethernet Layer 2 domain.
Challenges in RoCE:
The RoCE only supported RDMA transmission over Layer 2 Ethernet. However, it had limitations in terms of network scale. If you want to do RDMA within a rack, RoCEv1 is fine as a layer 2 protocol. But if you want to go across the rack, you need to go through a layer 3 IP device, which RoCEv1 doesn't support. That's where the need for IP-routable ROCE arises.
RoCEv2:
Customer demanded RDMA across racks focus on Data Center Networks. ROCEv2 encapsulates InfiniBand packets within UDP, IP, and Ethernet headers, using UDP port 4791 for RoCE packet identification.
RoCEv2 enables the RoCE protocol to be routed over layer 3 Ethernet, removing limitations imposed by RoCE.
Common InfiniBand Transport Protocol:
IniniBand, RoCE , and RoCEv2 share the same InfiniBand Transport Layer. The special features of InfiniBand's transport layer are its use of RDMA technology and Queue Pairs. This offers low latency and high throughput via kernel bypass and zero-copy application-level operations, resulting in efficient CPU usage and excellent scalability.
OpenFabrics Alliance (OFA) Verbs & OFA Stacks:
The RDMA OpenFabrics Alliance (OFA) Verbs API and the RDMA OFA Software Stack are important for implementing RDMA in high-performance computing.
The Verbs API provides a low-level interface for RDMA operations. The API includes functions for managing RDMA resources like queue pairs, completion queues, memory regions, and protection domains and supports operations like RDMA read, write, and send/receive.
领英推荐
The OFA software stack offers a single cross-platform framework supporting RDMA and kernel bypass. It includes drivers, libraries, and utilities that support various RDMA-capable network fabrics like InfiniBand, iWARP, and RoCE. Using the stack, applications can run across various RDMA-capable fabrics, directly accessing network hardware for low-latency, high-throughput data transfers.
Data Plane Development Kit (DPDK) with RDMA:
DPDK is a collection of libraries and drivers for fast packet processing in user space, enabling applications to bypass the kernel and directly interact with network hardware. DPDK can be used alongside Verbs API to enhance RDMA performance. This integration allows DPDK to handle packet processing while the Verb API handles RDMA operations. The user space RDMA implementation using DPDK (urdma) project is an example of this integration, using DPDK to provide a user space RDMA implementation, resulting in low latency communication and high data transfer rates.
Lossless Ethernet
Data Centre Bridging
The IEEE 802.1 Data Center Bridging Task Group developed DCB to improve data center network performance and reliability. DCB enhances traditional Ethernet by creating a lossless network. The components of DCB work together to create an efficient network environment, especially for data centres.
Priority Flow Control (PFC) provides link-level flow control for individual priority levels, preventing packet loss due to congestion. It works in conjunction with Enhanced Transmission Selection (ETS), which allocates bandwidth among different traffic classes to ensure a fair share of the bandwidth for each class. Explicit Congestion Notification (ECN) marks packets instead of dropping them when congestion is detected, allowing the sender to reduce its transmission rate. ECN collaborates with Data Center Quantized Congestion Notification (DCQCN), an end-to-end congestion control scheme for RDMA over Converged Ethernet (RoCEv2). DCQCN combines ECN and PFC to notify senders about congestion and ensure lossless transport, optimizing the network for high throughput and low-latency applications.
Data Center Bridging Exchange (DCBX) is a protocol that exchanges configuration parameters between devices to ensure consistent Quality of Service (QoS) settings. DCBX ensures that all devices are configured correctly to support PFC, ETS, and other features.
InfiniBand vs Data Center Bridging with Ethernet:
Data Center Bridging (DCB) and InfiniBand are both technologies aims to improve network performance and reliability. However, they achieve these goals through different mechanisms.
Subnet Management: InfiniBand uses a Subnet Manager to configure and manage the network, ensuring optimal routing and resource allocation. On the other hand, Ethernet typically uses protocols like Spanning Tree Protocol (STP) or Shortest Path Bridging (SPB) for subnet management to manage network topology and ensuring loop-free paths.
Quality of Service (QoS): InfiniBand implements QoS using Virtual Lanes (VLs), which are separate logical channels within a physical link. Each VL can have different QoS settings. DCB, enhances Ethernet with features like Priority Based Flow Control (PFC), Enhanced Transmission Selection (ETS) and Data Center Bridging Exchange (DCBX) to allocate bandwidth among different traffic classes, ensuring efficient use of available bandwidth.
Flow Control: InfiniBand employs Credit-Based Flow Control to manage buffer space and prevent packet loss. Receivers grant credits to senders, indicating available buffer space. DCB, on the other hand, uses Priority-based Flow Control (PFC) to provide link-level flow control for individual priority levels, preventing packet loss due to congestion.
Congestion Management: InfiniBand and DCB utilize Explicit Congestion Notification (ECN) for congestion management. However, DCB also includes Data Center Quantized Congestion Notification (DCQCN) for end-to-end congestion management and lossless transport.
DCB and InfiniBand share the common goal of enhancing network performance, reliability, and efficiency. InfiniBand uses a Subnet Manager, Virtual Lanes, Credit-Based Flow Control, and ECN, while DCB relies on DCBX, ETS, PFC, ECN, and DCQCN.
Conclusion:
In conclusion, the journey of Ethernet from a lossy to a lossless network has been transformative, particularly with the integration of Data Center Bridging (DCB) and RoCEv2. These advancements have enabled Ethernet to achieve InfiniBand-like high performance and low latency, making it a viable option for modern data centers. By leveraging DCB’s features such as Priority-based Flow Control (PFC), Enhanced Transmission Selection (ETS), and Data Center Bridging Exchange (DCBX), along with RoCEv2’s capabilities, Ethernet networks can now support high-throughput, low-latency applications with improved reliability and efficiency. This evolution not only simplifies network infrastructure but also enhances the overall performance and scalability of data center environments, paving the way for future innovations in network technology.
Resource: