登录查看更多内容

Ethernet’s Journey: From Lossy to Lossless - Part 2: Enhanced Ethernet with RoCEv2

Amit Godbole

Working for Customer Success @ Marvell | AI Infrastructure Enthusiast | Network Switching Systems Solutions

发布日期: 2025年1月29日

Part 1 of the article discussed why Ethernet is considered a lossy technology and introduces InfiniBand, a network fabric designed for lossless performance. It highlights InfiniBand’s key features, such as subnet management, QoS support through virtual lanes, credit-based flow control, data integrity with dual CRCs, and the use of Queue Pairs with RDMA, making it ideal for high-performance computing (HPC) applications.

RDMA and Ethernet

Ethernet has a broader vendor ecosystem with many suppliers, making it easier to find compatible hardware and software solutions. To achieve InfiniBand-like high performance coupled with low latency in Ethernet networks, the InfiniBand Trade Association (IBTA) introduced RDMA over Converged Ethernet (RoCE). This adapts InfiniBand's RDMA layer 4 protocol for use with the Ethernet.

RoCE

By replacing InfiniBand’s fabric-specific Layer 2 with Ethernet, RoCE provides the same RDMA service and semantics for Ethernet networks. RoCE supports carrying InfiniBand over Ethernet to implement RDMA over Ethernet.

Applications coded with OFA verbs run smoothly on the host Ethernet fabric with RoCE. The Ethernet fabric uses an Ethernet Layer 2 header, then an InfiniBand Layer 3 header. It forms a single Ethernet Layer 2 domain.

Challenges in RoCE:

The RoCE only supported RDMA transmission over Layer 2 Ethernet. However, it had limitations in terms of network scale. If you want to do RDMA within a rack, RoCEv1 is fine as a layer 2 protocol. But if you want to go across the rack, you need to go through a layer 3 IP device, which RoCEv1 doesn't support. That's where the need for IP-routable ROCE arises.

RoCEv2:

Customer demanded RDMA across racks focus on Data Center Networks. ROCEv2 encapsulates InfiniBand packets within UDP, IP, and Ethernet headers, using UDP port 4791 for RoCE packet identification.

RoCEv2 enables the RoCE protocol to be routed over layer 3 Ethernet, removing limitations imposed by RoCE.

Common InfiniBand Transport Protocol:

IniniBand, RoCE , and RoCEv2 share the same InfiniBand Transport Layer. The special features of InfiniBand's transport layer are its use of RDMA technology and Queue Pairs. This offers low latency and high throughput via kernel bypass and zero-copy application-level operations, resulting in efficient CPU usage and excellent scalability.

OpenFabrics Alliance (OFA) Verbs & OFA Stacks:

The RDMA OpenFabrics Alliance (OFA) Verbs API and the RDMA OFA Software Stack are important for implementing RDMA in high-performance computing.

The Verbs API provides a low-level interface for RDMA operations. The API includes functions for managing RDMA resources like queue pairs, completion queues, memory regions, and protection domains and supports operations like RDMA read, write, and send/receive.

领英推荐

What is Single Pair Ethernet? And How to Test and…

Cable & Connections 8 个月前

Siemens Ruggedcom RS950G Ethernet Switches: The…

Reliserv Solution 1 个月前

High bit-rate transportation with Private Line…

AimValley 1 年前

The OFA software stack offers a single cross-platform framework supporting RDMA and kernel bypass. It includes drivers, libraries, and utilities that support various RDMA-capable network fabrics like InfiniBand, iWARP, and RoCE. Using the stack, applications can run across various RDMA-capable fabrics, directly accessing network hardware for low-latency, high-throughput data transfers.

Data Plane Development Kit (DPDK) with RDMA:

DPDK is a collection of libraries and drivers for fast packet processing in user space, enabling applications to bypass the kernel and directly interact with network hardware. DPDK can be used alongside Verbs API to enhance RDMA performance. This integration allows DPDK to handle packet processing while the Verb API handles RDMA operations. The user space RDMA implementation using DPDK (urdma) project is an example of this integration, using DPDK to provide a user space RDMA implementation, resulting in low latency communication and high data transfer rates.

Lossless Ethernet

Data Centre Bridging

The IEEE 802.1 Data Center Bridging Task Group developed DCB to improve data center network performance and reliability. DCB enhances traditional Ethernet by creating a lossless network. The components of DCB work together to create an efficient network environment, especially for data centres.

Priority Flow Control (PFC) provides link-level flow control for individual priority levels, preventing packet loss due to congestion. It works in conjunction with Enhanced Transmission Selection (ETS), which allocates bandwidth among different traffic classes to ensure a fair share of the bandwidth for each class. Explicit Congestion Notification (ECN) marks packets instead of dropping them when congestion is detected, allowing the sender to reduce its transmission rate. ECN collaborates with Data Center Quantized Congestion Notification (DCQCN), an end-to-end congestion control scheme for RDMA over Converged Ethernet (RoCEv2). DCQCN combines ECN and PFC to notify senders about congestion and ensure lossless transport, optimizing the network for high throughput and low-latency applications.

Data Center Bridging Exchange (DCBX) is a protocol that exchanges configuration parameters between devices to ensure consistent Quality of Service (QoS) settings. DCBX ensures that all devices are configured correctly to support PFC, ETS, and other features.

InfiniBand vs Data Center Bridging with Ethernet:

Data Center Bridging (DCB) and InfiniBand are both technologies aims to improve network performance and reliability. However, they achieve these goals through different mechanisms.

Subnet Management: InfiniBand uses a Subnet Manager to configure and manage the network, ensuring optimal routing and resource allocation. On the other hand, Ethernet typically uses protocols like Spanning Tree Protocol (STP) or Shortest Path Bridging (SPB) for subnet management to manage network topology and ensuring loop-free paths.

Quality of Service (QoS): InfiniBand implements QoS using Virtual Lanes (VLs), which are separate logical channels within a physical link. Each VL can have different QoS settings. DCB, enhances Ethernet with features like Priority Based Flow Control (PFC), Enhanced Transmission Selection (ETS) and Data Center Bridging Exchange (DCBX) to allocate bandwidth among different traffic classes, ensuring efficient use of available bandwidth.

Flow Control: InfiniBand employs Credit-Based Flow Control to manage buffer space and prevent packet loss. Receivers grant credits to senders, indicating available buffer space. DCB, on the other hand, uses Priority-based Flow Control (PFC) to provide link-level flow control for individual priority levels, preventing packet loss due to congestion.

Congestion Management: InfiniBand and DCB utilize Explicit Congestion Notification (ECN) for congestion management. However, DCB also includes Data Center Quantized Congestion Notification (DCQCN) for end-to-end congestion management and lossless transport.

DCB and InfiniBand share the common goal of enhancing network performance, reliability, and efficiency. InfiniBand uses a Subnet Manager, Virtual Lanes, Credit-Based Flow Control, and ECN, while DCB relies on DCBX, ETS, PFC, ECN, and DCQCN.

Conclusion:

In conclusion, the journey of Ethernet from a lossy to a lossless network has been transformative, particularly with the integration of Data Center Bridging (DCB) and RoCEv2. These advancements have enabled Ethernet to achieve InfiniBand-like high performance and low latency, making it a viable option for modern data centers. By leveraging DCB’s features such as Priority-based Flow Control (PFC), Enhanced Transmission Selection (ETS), and Data Center Bridging Exchange (DCBX), along with RoCEv2’s capabilities, Ethernet networks can now support high-throughput, low-latency applications with improved reliability and efficiency. This evolution not only simplifies network infrastructure but also enhances the overall performance and scalability of data center environments, paving the way for future innovations in network technology.

Resource:

要查看或添加评论，请登录

Amit Godbole的更多文章

Ethernet’s Journey: From Lossy to Lossless - Part 1: Ethernet vs. InfiniBand

2025年1月22日

Ethernet’s Journey: From Lossy to Lossless - Part 1: Ethernet vs. InfiniBand

Lossy: Packet loss can occur due to two main reasons: the properties of the medium and congestion in the link. In…
Software is Eating the World, And Open Source Eating Software, So Is Open Source Eating the world!?

2024年8月22日

Software is Eating the World, And Open Source Eating Software, So Is Open Source Eating the world!?

"Software is Eating the World" is a well-known article by Marc Andreessen, a VC and founder of Netscope. Marc predicted…
Breaking Down the Semiconductor Supply Chain: From Design to End Product

2024年7月9日

Breaking Down the Semiconductor Supply Chain: From Design to End Product

The global digital transformation is happening everywhere, and the semiconductor industry plays a crucial role in our…

1 条评论
SDN via Overlay expected to be the highest revenue-generating segment!

2024年3月20日

SDN via Overlay expected to be the highest revenue-generating segment!

Future Market Insights reports estimate the SDN market valued at $95 billion by 2032. The revenue growth of SDN via…

2 条评论
SDN: Modularity based on abstraction is the way things get done

2024年2月20日

SDN: Modularity based on abstraction is the way things get done

?? Motivation for SDN Traditional IP networks posed a few challenges. Networks become hard to manage and evolve because…
Demystifying Deep Neural Networks and AI Workloads in Cloud Data Center Networks

2023年10月29日

Demystifying Deep Neural Networks and AI Workloads in Cloud Data Center Networks

AI has become a buzzword. One important aspect of Artificial Intelligence (AI) is its application, and the challenges…
A Sneak Peek into Cloud Data Center Networks.

2023年9月7日

A Sneak Peek into Cloud Data Center Networks.

Data Center network topology often uses spine-leaf architecture. It has two switching layers: spine and leaf.

2 条评论
Understanding Today's Data Infrastructure Market and the Network Switching

2023年6月13日

Understanding Today's Data Infrastructure Market and the Network Switching

The Data Infrastructure Network links everything together, people, people to devices and devices to other devices. In a…

1 条评论
Evolution of WAN and What Business Challenges SD-WAN Solves in Enterprise Networking

2023年3月28日

Evolution of WAN and What Business Challenges SD-WAN Solves in Enterprise Networking

What is SD-WAN: SD-WAN is a software defined approach to manage the wide area network that allows companies to leverage…
"Start with Why" by Simon Sinek - Book Summary

2023年1月23日

"Start with Why" by Simon Sinek - Book Summary

HOW GREAT LEADERS INSPIRE EVERYONE TO TAKE ACTION Assumptions: We make assumptions, which influence how we behave. We…

2 条评论

See all articles

Ethernet’s Journey: From Lossy to Lossless - Part 2: Enhanced Ethernet with RoCEv2

Amit Godbole

Working for Customer Success @ Marvell | AI Infrastructure Enthusiast | Network Switching Systems Solutions

RDMA and Ethernet