The Balancing Act: Compute, Memory & Network
The performance of applications is heavily influenced by compute cores, memory, and network interfaces. Striking the right balance between these components is crucial for achieving performance, scalability, and cost goals. This article offers some guidance on key trends and factors in the selection process.
Compute Core Counts
Historically, beefier CPUs with larger memory footprints were the norm due to monolithic application architectures. With the rise of cloud-native, distributed application architectures, higher core counts with smaller memory footprints per core have become more popular. The upcoming 288-Core Intel-Sierra-Forest and the 160-Core/320-Thread AMD-Turin point to trends in scaling microservices and software parallelism to massive levels.
Higher core counts offer:
GPUs are driving the trend towards parallel compute, not just for AI/ML but for all workloads. The CPU ISAs have also evolved to support parallel computing (e.g., Intel AVX and NVIDIA ARM-Vector-Extensions).
Parallel computing can provide orders of magnitude performance improvement for several applications, such as databases and 5G networks. Nvidia CEO Jensen Huang calls this Accelerated Computing, a key long-term trend that we must fully embrace. The Grace-Hopper Superchip ?showcases the growing integration of datacenter GPUs and CPUs. The trend towards GPU/CPU Fusion will likely strengthen in the near future.
Key Insight: Parallelizing workloads using the combination of massive-core CPUs and powerful GPUs [GPU-CPU Fusion] is likely the biggest driver of business value in the next decade.
Memory Capacity and Bandwidth
Sufficient memory capacity and bandwidth are crucial for ensuring that applications can handle large datasets and multiple concurrent connections . As the number of CPU cores increases, so do demands on memory size and bandwidth to keep all cores fed with data.
Importance of Memory Capacity
Importance of Memory Bandwidth:
Data Feeding: Memory bandwidth determines how quickly data can be fed to the CPU cores. High bandwidth is essential to prevent CPU cores from stalling due to memory access delays. The classic case is the AI/ML training workloads - the GPUs have increasing number of HBM or high performance memory , but still fall short of being able to fully utilize the compute (TFLOPs/second) capacity of the GPU.
Parallel Processing: As the number of CPU/GPU cores increases, the demand for memory bandwidth also increases, necessary to support efficient parallel processing across cores.
Data is transferred directly between GPU <-> GPU memories today with GPU-Direct. Without such direct data transfers, the CPU becomes the single choke-point. This direct-data-transfer paradigm, typically using RDMA technology, will continue to extend to other devices [ Video Processing Units, Storage Devices]. Related technologies include CXL and DevMemTCP.
Key Insight: Many traditional applications will be refactored to leverage the parallel compute power of GPUs to improve application performance.
Balancing Core Count and Memory Bandwidth
To balance core count and memory bandwidth for high-traffic applications, consider the following factors:
Network Interfaces
Network interfaces typically use Ethernet technology.
High-speed network interfaces (e.g., 100GbE) are essential for fast data transfer rates and low latency.
Factors to Consider
The network is indeed becoming central to the datacenter, transferring the data directly to devices such as GPUs/TPUs/VPUs (also called XPUs) and storage. Direct, zero-copy data transfers among compute/memory/storage elements is the critical network transformation needed to enable "parallel" software across the datacenter and the edge.
This direct data transfer across devices requires a control plane "switch" sub-system in the Network-Adapter or IPU. The dataplane for this switch is currently the PCIe-Switch. The current solutions are less than ideal and do not scale. It will be interesting to see how this data- switch-NIC combo hardware and software evolves for future high performance systems.
Key Insight: Beyond 100G, evaluate hardware offloads in the network adapter to ensure high performance for your applications.
Differences Between Cloud VM and On-Premises Enterprise Deployments
While the general principles of resource allocation apply to both cloud and on-premises environments, there are some key differences to consider:
Increasingly, enterprises are moving towards a hybrid cloud, with on-prem infrastructure, a private cloud, and multiple public clouds. This model is maturing, and driving the enterprise towards cloud-native applications, with legacy applications hosted on the on-prem infrastructure.
Rules of Thumb
?While actual resource requirements will vary widely based on applications and workload characteristics, here are some rough rules of thumb for memory size and network I/O requirements per CPU Core (or vCPU in the cloud instances):
Memory Size (GB/Core):
Network I/O (Gbps/Core):
The cloud rules of thumb reflect the general trend of cloud VMs running multi-tenant distributed microservices-oriented applications, which typically require higher networking bandwidths per core.
In contrast, enterprise on-premises environments may have a mix of traditional monolithic applications and modern cloud-native applications, with lower current network I/O needs per core, but gradually increasing as enterprise applications become more cloud-native.
The private and hybrid cloud infrastructure would fall in between the public cloud and on-prem enterprise in terms of rules of thumb.
Note that the enterprises and private clouds can use different optimization points - for example, enterprise can design on-prem infrastructure for median network load, whereas the cloud is typically optimized for the peak network load and defined cloud SLAs.
These rules of thumbs don’t apply to GPU based LLM training and inference workloads. AI/ML training is compute-intensive, memory-intensive, and heavily network-bandwidth-intensive, requiring powerful GPUs alongside CPUs. I’ll cover those rules of thumb in a different article.
Workload Scenario Walk Through
Let's review a few scenarios below.
In general, microservices apps need less CPU, less memory and more network IO than monolithic apps.
Web Server
Web servers in the cloud often leverage auto-scaling and load balancing, allowing for more efficient resource utilization across multiple instances. Each web server could be using a small number of cores (8-16). On-premises web servers may need more dedicated resources to handle potential traffic spikes than monolithic architectures.
Popular Web Servers: NGNX, Apache HTTP Server, Tomcat
Database Server
Legacy Enterprise databases often have monolithic architecture, large memory requirements and may benefit from fast network interfaces for data replication and high availability. Both cloud and on-prem enterprise databases may use distributed architectures, requiring higher network bandwidth for data replication and synchronization.
In-memory databases typically require much higher memory capacity, and even higher network bandwidth.
Popular Databases:
Streaming Video Delivery
Cloud-based video streaming can leverage content delivery networks (CDNs) and edge caching, reducing the need for high-performance instances. This workload is perhaps best deployed on the cloud. On-premises video streaming may require more powerful instances to handle transcoding and delivery without the benefits of CDNs.
Popular Streaming Video Apps:
CDNs
CDN edge caching servers and origin storage both can use ?NVMe over RDMA, for low-latency content delivery to end-users.
Popular CDNs: Netflix hosts its CDNs for its content, Akamai & Fastly host CDNs for content providers
High Performance Storage
Both cloud and enterprises deploy storage arrays that support NVMe-over-RDMA storage. Cloud providers offer specialized “distributed” instances (also called disaggregated storage) with support for NVMe-over-RDMA protocol. The network bandwidth needs are on the high side, particularly for SSD based storage.
All major cloud service providers (CSPs) host their own cloud storage with multiple tiers and options - Amazon AWS EBS (Elastic Block Storage), Google GCP Persistent Disk and Microsoft Azure Premium SSD Managed Disks. Enterprise firms like Pure Storage, NetApp and DELL EMC provide high performance enterprise storage arrays.
LLM Training
AI/ML training involves massive datasets and requires HBM (high performance memory) and network bandwidth to efficiently distribute training data across multiple GPUs and servers. Recent data suggests that 20% of the AI/ML systems are sold in the enterprise with the rest going into the cloud.
LLM Inference
AI/ML inference needs lower memory and network bandwidth compared to training but still requires substantial resources for real-time data processing and model serving. Inference for non-LLM application, for example, imaging-based defect detection on the manufacturing floor, can be accomplished on CPUs and does not need GPUs. Inference can happen both in the cloud and edge but will likely gravitate to the edge as smaller compound models become more capable. Driving down the inference cost per token is a key driver for growing the adoption of AI/ML technologies.
Real-time Data Processing
Real-time data processing in the cloud can leverage distributed architectures and auto-scaling to handle varying workloads. Latency requirements such as controlling factory machines can require these workloads to be on-prem. These edge deployments typically use Time Sensitive Networks (TSN).
Real-time data processing is becoming increasingly important, driving the shift from the cloud to the edge. Applications such as autonomous cars, AI agents, automated manufacturing machine control, IOT devices, Augmented Reality and Online retail are some examples.
Key Insight: Unlike slow human perception time of say a 100ms, AI agents can respond in microseconds. In an agentic world, we will not have the luxury to go to the cloud. We must rapidly transition to an Edge-First AI architecture.
Summary
This is a general framework for selecting the appropriate CPU, memory, and network configurations for various use cases in both cloud and on-premises enterprise environments. Further, deployments should take into consideration:
By carefully evaluating these factors, IT professionals can ensure that their infrastructure is optimized for performance, scalability, and cost.
References
Glossary
Accelerated Computing: A computing model that leverages GPUs alongside CPUs to increase performance and efficiency, particularly for parallel processing tasks such as AI/ML workloads. Introduced by Nvidia CEO Jensen Huang.
AVX (Advanced Vector Extensions): A set of instructions for doing SIMD (Single Instruction, Multiple Data) vector operations on Intel processors
Bandwidth: The maximum rate of data transfer across a given path.
CDN (Content Delivery Network): A network of servers that deliver web content to users based on their geographic location, the origin of the webpage, and the content delivery server. CDNs improve load times and reduce latency.
CPU (Central Processing Unit): The primary component of a computer that performs most of the processing inside a computer. It executes instructions from programs and processes data.
DPUs (Data Processing Units): Specialized processors designed to handle data-centric tasks such as data movement, security, and offloading network tasks from the CPU.
Ethernet: A family of networking technologies commonly used in local area networks (LAN), metropolitan area networks (MAN), and wide area networks (WAN).
GPU (Graphics Processing Unit): A specialized processor designed to accelerate graphics rendering and parallel processing tasks. GPUs are increasingly used for AI/ML and general-purpose computing tasks, particularly those involving large-scale parallel processing.
HBM (High Bandwidth Memory): A type of high performance memory used in GPUs/CPUs and other Accelerators, much faster than the traditional DRAM memory attached to CPUs
InfiniBand: A high-performance, low-latency networking protocol commonly used in high-performance computing (HPC) environments.
ISA (Instruction Set Architecture): The part of the computer architecture related to programming, including the instruction set, data types, registers, addressing modes, and memory architecture.
LAN (Local Area Network): A network that connects computers within a limited area such as a residence, school, laboratory, or office building.
LLM (Large Language Model): A type of AI model trained on vast amounts of text data to understand and generate human language, such as GPT-3.
Monolithic Application: A software application that is designed as a single, indivisible unit. Traditionally, these applications are large, complex, and difficult to modify or scale.
NVMe (Non-Volatile Memory Express): A high-performance, scalable host controller interface utilizing PCI Express-based solid-state drives.
PCIe (Peripheral Component Interconnect Express): A high-speed interface standard for connecting components like graphics cards, SSDs, and network cards to a computer's motherboard.
RDMA (Remote Direct Memory Access): A technology that allows computers in a network to exchange data in main memory without involving the CPU, improving throughput and reducing latency.
Scalability: The ability of a system to handle a growing amount of work, or its potential to accommodate growth.
SLA: Service Level Agreement
SR-IOV (Single Root I/O Virtualization): A specification that allows the isolation of the PCI Express resources for manageability and performance reasons in virtualized environments.
TSN (Time-Sensitive Networking): IEEE standard that defines time-sensitive data streams to be transmitted over Ethernet networks, ensuring low latency and high reliability.
vCPU (Virtual Central Processing Unit): A virtual abstraction of a physical CPU core that is allocated to virtual machines in cloud environments.
vSwitch (Virtual Switch): A software-based network switch that allows virtual machines (VMs) to communicate with each other within the same host or across different hosts in a virtualized environment.
#CloudComputing #EnterpriseIT #NetworkOptimization #ResourceAllocation #PerformanceTuning #VirtualizationTechnologies #RDMA #NVMe
Practice Director - Network Infra Design & Engineering Services - Data Centres and Smart Campuses
5 个月Very well articulated!
Director of Network Platform Product Management at Intel
5 个月Nice write up on all the factors one needs to consider Thanks