How CXL Technology Addresses Data Center Memory Challenge
Abstract
Compute Express Link (CXL), a revolutionary device interconnect technology standard, has emerged as the industry's new storage technology that breaks through memory bottlenecks. It enables not only memory capacity/bandwidth expansion but also facilitates heterogeneous connectivity and decoupling of data center resource pools. In data centers, CXL technology enables the interconnection of diverse computing and storage resources, addressing memory challenges with enhanced system performance and efficiency.
The Emergence of CXL Technology
The rapid development of applications such as cloud computing, big data analytics, artificial intelligence, and machine learning has led to explosive growth in data center storage and processing requirements. Traditional DDR memory interfaces have limited scalability in total bandwidth and average bandwidth and capacity per core. Particularly in data centers, the emergence of new memory interface technology, CXL, aims to address these limitations.
In data centers, CPUs are tightly coupled with memory, with each generation of CPUs using the latest memory technology to achieve higher capacity and bandwidth. Since 2012, there has been a rapid increase in the number of CPU cores, but the memory bandwidth and capacity per core have not proportionally increased; instead, they have decreased. This trend is expected to continue in the future, with memory capacity growing faster than memory bandwidth. This has a significant impact on system performance.
Additionally, the significant gap in latency and cost between direct-attached DRAM and SSDs often leads to low utilization of expensive memory resources. Incorrect calculations and mismatched ratios between compute and memory resources can easily result in stranded memory, where memory remains idle and cannot be effectively utilized or allocated. As one of the world's most capital-intensive industries, data center operations face significant burden of low utilization. Microsoft stated that 50% of total server costs come from DRAM. Despite the high cost of DRAM, 25% of DRAM memory is still wasted. The following figure shows a similar pattern from statistics within Meta. The proportion of memory costs to the overall system costs is actually increasing, indicating that memory has become a major cost driver, even surpassing the CPU. To address this issue, the CXL memory resource pool can be used. By dynamically allocating memory resources within the system, it becomes possible to optimize the compute-to-memory ratio, thereby reducing the Total Cost of Ownership (TCO).
?
In response to the limitations of traditional memory, the industry has been exploring new memory interface technologies and system architectures.
Peripheral Component Interconnect Express (PCIe) is the preferred memory interface technology. PCIe is a serial bus, and it incurs relatively higher communication overhead between different devices from a performance and software perspective. However, the good news is that PCIe 7.0 is scheduled to be finalized by the end of 2023, offering speeds of up to 256 GB/s. This is less than two years since the introduction of PCIe 4.0 with a speed of 16 GT/s. The main driving force behind the accelerated development of PCIe is the demand for cloud computing. In the past, PCIe used to double its data transfer rate every 3 to 4 years or even 7 years.
System architecture has undergone several generations of evolution. Initially, attempts were made to achieve resource pooling across multiple servers using technologies such as Remote Direct Memory Access (RDMA) over general-purpose Ethernet or InfiniBand. However, these communication methods typically have higher latency (in the order of microseconds for RDMA compared to tens of nanoseconds for local memory) and lower bandwidth. They also lack critical features such as memory consistency.
In 2010, Cache Coherent Interconnect for Accelerators (CCIX) became a potential industry standard. The driving factors behind CCIX were the need for faster interconnects than what was currently available and the requirement for cache coherence to enable faster memory access in heterogeneous multiprocessor systems. The biggest advantage of the CCIX specification is that it builds on the PCI Express specification, but it has never been widely adopted due to a lack of critical industry support.
On the other hand, CXL leverages the existing PCIe 5.0 physical and electrical layer standards and ecosystem to provide cache coherence and low-latency features for memory load/store transactions. By establishing industry-standard protocols that are supported by most of the major players in the industry, CXL enables the transition towards heterogeneous computing and gains widespread industry support. CXL 1.1 has been supported by AMD's Genoa and Intel's Sapphire Rapids since late 2022 and early 2023. By now, CXL has become one of the most promising technologies in both industry and academia to address these challenges.
CXL is built upon the PCIe physical layer and inherits existing PCIe physical and electrical interface characteristics, providing high bandwidth and scalability. Additionally, compared to traditional PCIe interconnects, CXL offers lower latency and a unique set of new features that enable CPUs to communicate with peripheral devices (such as memory extensions and accelerators and their attached storage) in a cache-coherent manner with load/store transactions. This technology keeps CPU memory space consistent with memory on add-on devices, allowing resource sharing for higher performance while reducing software stack complexity. Memory-related device expansion is one of CXL's main target scenarios.
CXL Technical Principles
CXL actually contains three protocols, but not all protocols are a magic solution for latency. CXL.io (running on the physical layer of the PCIe bus) still exhibits similar latency as previous generations. However, the other two protocols, CXL.cache and CXL.mem, take faster paths and reduce latency. Most CXL memory controllers introduce an additional latency of around 100 to 200 nanoseconds, and extra retimer can increase or take a few dozen nanoseconds depending on the distance between the device and the CPU.
领英推荐
CXL utilizes protocol multiplexing at the PCIe PHY layer. The CXL 1.0/1.1 specifications include three supported protocols: CXL.io, CXL.cache, and CXL.mem. Most CXL devices will use a combination of these three protocols. CXL.io utilizes the same Transaction Layer Packets (TLPs) and Data Link Layer Packets (DLLPs) as PCIe. TLP/DLLP is overlaid on the payload portion of CXL flits. CXL defines strategies for providing the required Quality of Service (QoS) across different protocol stacks. PHY-level protocol multiplexing ensures that latency-sensitive protocols such as CXL.cache and CXL.memory have the same low latency as native CPU-to-CPU symmetric consistent links. CXL defines the upper limit of the pin-to-pin response time for these latency-sensitive protocols to ensure that platform performance is not adversely affected by large differences in latency between different devices that achieve consistency and memory semantics.
?????????CXL.io can be considered a similar but enhanced version of standard PCIe.
-??????It is a protocol used for initialization, linking, device discovery and enumeration, and registration access. It provides an interface for I/O devices, similar to PCIe Gen 5. CXL devices must also support CXL.io.
?????????CXL.cache is a protocol that defines the interaction between a host, typically a CPU, and a device, such as a CXL memory module or accelerator.
-??????Due to secure use of its local copy, CXL.cache allows CXL devices to consistently access and cache the memory of the host CPU. This can be visualized as a GPU directly caching data from the CPU's memory.
?????????CXL.mem is the protocol that provides the host processor (typically a CPU) with direct access to device-attached memory using load/store commands.
-??????It allows the host CPU to consistently access the device's memory as if the CPU is using a dedicated storage level memory device or using memory on a GPU/accelerator device.
CXL 2.0 introduces support for memory pools and CXL switching, enabling seamless connectivity and communication among numerous hosts and devices, resulting in a significant increase in the number of devices connected on the CXL network. Multiple hosts can connect to a switch, which then connects to various devices. In cases where the CXL device is multi-headed and connected to the root ports of multiple hosts, it is also possible to achieve connectivity without a switch. Single Logical Device (SLD) allows individual hosts to utilize separate memory pools, while Multiple Logical Devices (MLDs) aims to couple multiple hosts to share the same physical memory pool.
On a distributed memory resource network, the Fabric Manager is responsible for memory allocation and device orchestration. It acts as the control plane or coordinator, typically residing on a separate chip or switch that does not require high performance as it does not process data. The Fabric Manager provides a standard API for controlling and managing the system, enabling fine-grained resource allocation, hot-swappable capability, and dynamic scaling without restarting. Combining all these capabilities, Microsoft reports that adopting CXL-based memory pooling can potentially reduce overall memory requirements by 10% and consequently lower total server costs by 5%.
CXL Technology Development Trends
CXL is gaining momentum, with major players such as Samsung, SK Hynix, Marvell, Rambus, AMD, and others accelerating their development efforts. Hyper scalers, including public cloud providers, are now exploring the use of CXL to leverage memory pooling, addressing issues of stranded memory and dynamically increasing bandwidth and capacity. However, there is currently a lack of multi-level memory scheduling, management, and monitoring technologies for applications that utilize local/external hybrid resource pools. Therefore, for cloud service providers committed to adopting resource pool systems based on CXL technology on a large scale, they either need to build their own solutions or find suitable system software and hardware vendors. In this regard, major cloud service providers like Microsoft and Meta have taken the lead.
Microsoft's Pond solution utilizes machine learning to analyze whether virtual machines (VMs) are latency-sensitive and the amount of unused (untouched) memory, and thus to determine whether to schedule VMs to a suitable local or CXL remote memory location. In collaboration with a performance monitoring system, it continuously adjusts and migrates VMs to optimize system performance.
The VM scheduling program in Part A utilizes ML-based predictions to identify latency-sensitive VMs and determine their placement based on the estimated amount of untapped memory.
In Part B, if the QoS is not satisfied, the monitoring system triggers the Mitigation Manager, which will reconfigure the VM.
As a leader in smart computing center networks, Ruijie Networks is committed to providing customers with innovative products and solutions that drive industry growth and innovation, and keeping customers informed about the latest technological trends shaping the future. Ruijie Networks will continue to innovate and lead the trend in the smart computing era.
?
References:
TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory?https://arxiv.org/abs/2206.02878
Pond: CXL-Based Memory Pooling Systems for Cloud Platforms https://arxiv.org/abs/2203.00241
Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices https://arxiv.org/abs/2303.15375
Compute Express Link? Specification 3.0 whitepaper https://www.computeexpresslink.org/
Design and Analysis of CXL Performance Models for Tightly-Coupled Heterogeneous Computing?https://dl.acm.org/doi/abs/10.1145/3529336.3530817
Memory Disaggregation: Advances and Open Challenges???????????
https://arxiv.org/abs/2305.03943
Network Requirements for Resource Disaggregation https://www.usenix.org/system/files/conference/osdi16/osdi16-gao.pdf
A Case for CXL-Centric Server Processors????????????????
https://arxiv.org/abs/2305.05033
?
Computer Architecture Researcher@Samsung, Ph.D.@CS from Purdue University
7 个月A great introduction to CXL!
HW/SW Co-design, RISC-V, full-stack AI from bare silicon to computational consciousness.
11 个月Cool!