JILI8 Login app.Claim Your Free 999 Pesos Bonus Today

The performance of applications is heavily influenced by compute cores, memory, and network interfaces. Striking the right balance between these components is crucial for achieving performance, scalability, and cost goals. This article offers some guidance on key trends and factors in the selection process.

Compute Core Counts

Historically, beefier CPUs with larger memory footprints were the norm due to monolithic application architectures. With the rise of cloud-native, distributed application architectures, higher core counts with smaller memory footprints per core have become more popular. The upcoming 288-Core Intel-Sierra-Forest and the 160-Core/320-Thread AMD-Turin point to trends in scaling microservices and software parallelism to massive levels.

Higher core counts offer:

Higher scale of parallel processing, handling multiple users, threads and workloads simultaneously. Applications with a high degree of parallelism, such as web servers, load balancers, content delivery networks (CDNs), and AI/ML training/inference benefit the most.
Limitations: Increasing core counts beyond a certain point may not yield linear performance gains due to memory bandwidth limitations.

GPUs are driving the trend towards parallel compute, not just for AI/ML but for all workloads. The CPU ISAs have also evolved to support parallel computing (e.g., Intel AVX and NVIDIA ARM-Vector-Extensions).

Parallel computing can provide orders of magnitude performance improvement for several applications, such as databases and 5G networks. Nvidia CEO Jensen Huang calls this Accelerated Computing, a key long-term trend that we must fully embrace. The Grace-Hopper Superchip ?showcases the growing integration of datacenter GPUs and CPUs. The trend towards GPU/CPU Fusion will likely strengthen in the near future.

Key Insight: Parallelizing workloads using the combination of massive-core CPUs and powerful GPUs [GPU-CPU Fusion] is likely the biggest driver of business value in the next decade.

Memory Capacity and Bandwidth

Sufficient memory capacity and bandwidth are crucial for ensuring that applications can handle large datasets and multiple concurrent connections . As the number of CPU cores increases, so do demands on memory size and bandwidth to keep all cores fed with data.

Importance of Memory Capacity

Large Datasets: Applications like in-memory databases and real-time analytics require large memory to store and process datasets efficiently. Insufficient memory can lead to frequent data swapping between disk and RAM, severely degrading performance.
Concurrent Connections: Applications such as web servers, databases, and virtualized environments often handle many simultaneous connections. Adequate memory ensures that these connections can be managed effectively without excessive context switching latency.

Importance of Memory Bandwidth:

Data Feeding: Memory bandwidth determines how quickly data can be fed to the CPU cores. High bandwidth is essential to prevent CPU cores from stalling due to memory access delays. The classic case is the AI/ML training workloads - the GPUs have increasing number of HBM or high performance memory , but still fall short of being able to fully utilize the compute (TFLOPs/second) capacity of the GPU.

Parallel Processing: As the number of CPU/GPU cores increases, the demand for memory bandwidth also increases, necessary to support efficient parallel processing across cores.

Data is transferred directly between GPU <-> GPU memories today with GPU-Direct. Without such direct data transfers, the CPU becomes the single choke-point. This direct-data-transfer paradigm, typically using RDMA technology, will continue to extend to other devices [ Video Processing Units, Storage Devices]. Related technologies include CXL and DevMemTCP.

Key Insight: Many traditional applications will be refactored to leverage the parallel compute power of GPUs to improve application performance.

Balancing Core Count and Memory Bandwidth

To balance core count and memory bandwidth for high-traffic applications, consider the following factors:

Application Characteristics: Understand the application's memory access patterns, working set size, and degree of parallelism to determine the optimal balance between core count, memory size, and memory bandwidth.
CPU / Memory Selection: Some CPU sockets are better optimized for memory bandwidth utilization, with features like integrated memory controllers, larger caches, and advanced prefetching mechanisms. Also, look for CPU ISA extensions that accelerate your workload.

Network Interfaces

Network interfaces typically use Ethernet technology.

Local Area Networks (LANs) use TCP or UDP transport protocols.
Web Servers running on LANs use HTTP which can run both on TCP and QUIC/UDP.
Applications requiring high-performance memory-to-memory data transfers, such as distributed in-memory databases, typically use RDMA services.
AI/ML training networks and HPC applications: sometimes use InfiniBand as a networking protocol.

High-speed network interfaces (e.g., 100GbE) are essential for fast data transfer rates and low latency.

Distributed Applications: massively depend on high-performance networks since the application is distributed across many servers.
CDNs: Quicker delivery of content to users, improving load times and reducing buffering, to maintain user engagement.
Streaming Video Delivery: Ensures continuous data flow, preventing interruptions and maintaining video quality, which is crucial for a high-quality viewing experience.
AI/ML Workloads: Require highest-bandwidth and low tail latency networking to handle large data sets and distributed training/inference tasks efficiently. AI/ML training systems for LLMs gravitate towards the highest performance Networks Adapters sporting 400G / 800G Ethernet interfaces.

Factors to Consider

Network Throughput Requirements: Determine the expected network throughput requirements based on the workload type (e.g., web servers, databases, media streaming, AI/ML workloads).
PCI Express (PCIe) Lanes: Ensure the server has sufficient PCIe lanes to support the desired number of high-speed Ethernet adapters without creating a bottleneck.
Network Offload Capabilities: Offloading network processing tasks to specialized hardware (e.g., SmartNICs, DPUs) can reduce the CPU and memory bandwidth requirements for network-intensive workloads.
Virtualization Overhead: In virtualized environments, technologies like SR-IOV, vSwitch, and TEP can reduce the virtualization tax.

The network is indeed becoming central to the datacenter, transferring the data directly to devices such as GPUs/TPUs/VPUs (also called XPUs) and storage. Direct, zero-copy data transfers among compute/memory/storage elements is the critical network transformation needed to enable "parallel" software across the datacenter and the edge.

This direct data transfer across devices requires a control plane "switch" sub-system in the Network-Adapter or IPU. The dataplane for this switch is currently the PCIe-Switch. The current solutions are less than ideal and do not scale. It will be interesting to see how this data- switch-NIC combo hardware and software evolves for future high performance systems.

Key Insight: Beyond 100G, evaluate hardware offloads in the network adapter to ensure high performance for your applications.

Differences Between Cloud VM and On-Premises Enterprise Deployments

While the general principles of resource allocation apply to both cloud and on-premises environments, there are some key differences to consider:

Workload Characteristics: Cloud VMs are often used for more scalable and distributed workloads, while on-premises deployments may have a mix of traditional monolithic applications and modern distributed applications.
Virtualization Overhead: Cloud VMs incur virtualization overhead, which can impact CPU, memory, and network performance. On-premises deployments may have a combination of virtualized and bare-metal deployments.
Scalability and Elasticity: Cloud VMs can scale out elastically and dynamically. On-premises deployments may have scalability and elasticity limitations, and are sometimes statically allocated, requiring powerful individual servers. That said, enterprises are adopting Kubernetes, which brings cloud-native technologies and methods to enterprise workloads - in that sense, the enterprise is getting cloudified rapidly.

Increasingly, enterprises are moving towards a hybrid cloud, with on-prem infrastructure, a private cloud, and multiple public clouds. This model is maturing, and driving the enterprise towards cloud-native applications, with legacy applications hosted on the on-prem infrastructure.

Rules of Thumb

?While actual resource requirements will vary widely based on applications and workload characteristics, here are some rough rules of thumb for memory size and network I/O requirements per CPU Core (or vCPU in the cloud instances):

Memory Size (GB/Core):

Cloud-VM: 2-8 GB/Core
Enterprise On-Prem: 4-16 GB/Core

Network I/O (Gbps/Core):

Cloud-VM: 1 - 4 Gbps/Core
Enterprise On-Prem: 0.25 - 1 Gbps/Core

The cloud rules of thumb reflect the general trend of cloud VMs running multi-tenant distributed microservices-oriented applications, which typically require higher networking bandwidths per core.

In contrast, enterprise on-premises environments may have a mix of traditional monolithic applications and modern cloud-native applications, with lower current network I/O needs per core, but gradually increasing as enterprise applications become more cloud-native.

The private and hybrid cloud infrastructure would fall in between the public cloud and on-prem enterprise in terms of rules of thumb.

Note that the enterprises and private clouds can use different optimization points - for example, enterprise can design on-prem infrastructure for median network load, whereas the cloud is typically optimized for the peak network load and defined cloud SLAs.

These rules of thumbs don’t apply to GPU based LLM training and inference workloads. AI/ML training is compute-intensive, memory-intensive, and heavily network-bandwidth-intensive, requiring powerful GPUs alongside CPUs. I’ll cover those rules of thumb in a different article.

Workload Scenario Walk Through

Let's review a few scenarios below.

In general, microservices apps need less CPU, less memory and more network IO than monolithic apps.

Web Server

Web servers in the cloud often leverage auto-scaling and load balancing, allowing for more efficient resource utilization across multiple instances. Each web server could be using a small number of cores (8-16). On-premises web servers may need more dedicated resources to handle potential traffic spikes than monolithic architectures.

CPU: Low
Memory: Moderate
Network: Moderate

Popular Web Servers: NGNX, Apache HTTP Server, Tomcat

Database Server

Legacy Enterprise databases often have monolithic architecture, large memory requirements and may benefit from fast network interfaces for data replication and high availability. Both cloud and on-prem enterprise databases may use distributed architectures, requiring higher network bandwidth for data replication and synchronization.

In-memory databases typically require much higher memory capacity, and even higher network bandwidth.

CPU: Moderate to High
Memory: High to Very High
Network: High to Very High

Popular Databases:

Distributed: Cassandra, MongoDB
In-Memory: SAP HANA, Redis, Memcached
Traditional: PostgreSQL, Oracle DB, MySQL

Streaming Video Delivery

Cloud-based video streaming can leverage content delivery networks (CDNs) and edge caching, reducing the need for high-performance instances. This workload is perhaps best deployed on the cloud. On-premises video streaming may require more powerful instances to handle transcoding and delivery without the benefits of CDNs.

CPU: High, unless offloaded to a Video Accelerator
Memory: Moderate
Network: Very High

CDNs

CDN edge caching servers and origin storage both can use ?NVMe over RDMA, for low-latency content delivery to end-users.

CPU: Moderate
Memory: Moderate
Network: Very High

Popular CDNs: Netflix hosts its CDNs for its content, Akamai & Fastly host CDNs for content providers

High Performance Storage

Both cloud and enterprises deploy storage arrays that support NVMe-over-RDMA storage. Cloud providers offer specialized “distributed” instances (also called disaggregated storage) with support for NVMe-over-RDMA protocol. The network bandwidth needs are on the high side, particularly for SSD based storage.

CPU: Moderate
Memory: Moderate
Network: High

All major cloud service providers (CSPs) host their own cloud storage with multiple tiers and options - Amazon AWS EBS (Elastic Block Storage), Google GCP Persistent Disk and Microsoft Azure Premium SSD Managed Disks. Enterprise firms like Pure Storage, NetApp and DELL EMC provide high performance enterprise storage arrays.

LLM Training

AI/ML training involves massive datasets and requires HBM (high performance memory) and network bandwidth to efficiently distribute training data across multiple GPUs and servers. Recent data suggests that 20% of the AI/ML systems are sold in the enterprise with the rest going into the cloud.

CPU/GPU: High Performance, GPU Optimized
Memory: High Performance HBM [HBM capacity is smaller compared to traditional DRAM]
Network: Very High

LLM Inference

AI/ML inference needs lower memory and network bandwidth compared to training but still requires substantial resources for real-time data processing and model serving. Inference for non-LLM application, for example, imaging-based defect detection on the manufacturing floor, can be accomplished on CPUs and does not need GPUs. Inference can happen both in the cloud and edge but will likely gravitate to the edge as smaller compound models become more capable. Driving down the inference cost per token is a key driver for growing the adoption of AI/ML technologies.

?CPU/GPU: GPU Optimized
Memory: High Performance HBM [HBM capacity is smaller compared to traditional DRAM]
Network: High

Real-time Data Processing

Real-time data processing in the cloud can leverage distributed architectures and auto-scaling to handle varying workloads. Latency requirements such as controlling factory machines can require these workloads to be on-prem. These edge deployments typically use Time Sensitive Networks (TSN).

CPU: High
Memory: High
Network: High

Real-time data processing is becoming increasingly important, driving the shift from the cloud to the edge. Applications such as autonomous cars, AI agents, automated manufacturing machine control, IOT devices, Augmented Reality and Online retail are some examples.

Key Insight: Unlike slow human perception time of say a 100ms, AI agents can respond in microseconds. In an agentic world, we will not have the luxury to go to the cloud. We must rapidly transition to an Edge-First AI architecture.

Summary

This is a general framework for selecting the appropriate CPU, memory, and network configurations for various use cases in both cloud and on-premises enterprise environments. Further, deployments should take into consideration:

Application Requirements: Unique performance and resource needs. For example, the need to scale up and scale down resources based on demand could be important for a shopping cart application. The nature of the workload (e.g., batch processing vs. real-time processing) can influence resource requirements.
Future growth: Plan for future expansion and evolving technology trends.

By carefully evaluating these factors, IT professionals can ensure that their infrastructure is optimized for performance, scalability, and cost.

References

Glossary

Accelerated Computing: A computing model that leverages GPUs alongside CPUs to increase performance and efficiency, particularly for parallel processing tasks such as AI/ML workloads. Introduced by Nvidia CEO Jensen Huang.

AVX (Advanced Vector Extensions): A set of instructions for doing SIMD (Single Instruction, Multiple Data) vector operations on Intel processors

Bandwidth: The maximum rate of data transfer across a given path.

CDN (Content Delivery Network): A network of servers that deliver web content to users based on their geographic location, the origin of the webpage, and the content delivery server. CDNs improve load times and reduce latency.

CPU (Central Processing Unit): The primary component of a computer that performs most of the processing inside a computer. It executes instructions from programs and processes data.

DPUs (Data Processing Units): Specialized processors designed to handle data-centric tasks such as data movement, security, and offloading network tasks from the CPU.

Ethernet: A family of networking technologies commonly used in local area networks (LAN), metropolitan area networks (MAN), and wide area networks (WAN).

GPU (Graphics Processing Unit): A specialized processor designed to accelerate graphics rendering and parallel processing tasks. GPUs are increasingly used for AI/ML and general-purpose computing tasks, particularly those involving large-scale parallel processing.

HBM (High Bandwidth Memory): A type of high performance memory used in GPUs/CPUs and other Accelerators, much faster than the traditional DRAM memory attached to CPUs

InfiniBand: A high-performance, low-latency networking protocol commonly used in high-performance computing (HPC) environments.

ISA (Instruction Set Architecture): The part of the computer architecture related to programming, including the instruction set, data types, registers, addressing modes, and memory architecture.

LAN (Local Area Network): A network that connects computers within a limited area such as a residence, school, laboratory, or office building.

LLM (Large Language Model): A type of AI model trained on vast amounts of text data to understand and generate human language, such as GPT-3.

Monolithic Application: A software application that is designed as a single, indivisible unit. Traditionally, these applications are large, complex, and difficult to modify or scale.

NVMe (Non-Volatile Memory Express): A high-performance, scalable host controller interface utilizing PCI Express-based solid-state drives.

PCIe (Peripheral Component Interconnect Express): A high-speed interface standard for connecting components like graphics cards, SSDs, and network cards to a computer's motherboard.

RDMA (Remote Direct Memory Access): A technology that allows computers in a network to exchange data in main memory without involving the CPU, improving throughput and reducing latency.

Scalability: The ability of a system to handle a growing amount of work, or its potential to accommodate growth.

SLA: Service Level Agreement

SR-IOV (Single Root I/O Virtualization): A specification that allows the isolation of the PCI Express resources for manageability and performance reasons in virtualized environments.

TSN (Time-Sensitive Networking): IEEE standard that defines time-sensitive data streams to be transmitted over Ethernet networks, ensuring low latency and high reliability.

vCPU (Virtual Central Processing Unit): A virtual abstraction of a physical CPU core that is allocated to virtual machines in cloud environments.

vSwitch (Virtual Switch): A software-based network switch that allows virtual machines (VMs) to communicate with each other within the same host or across different hosts in a virtualized environment.

#CloudComputing #EnterpriseIT #NetworkOptimization #ResourceAllocation #PerformanceTuning #VirtualizationTechnologies #RDMA #NVMe

The Balancing Act: Compute, Memory & Network

Rakesh Cheerla

Technologist & Product Manager

Compute Core Counts

Memory Capacity and Bandwidth

Importance of Memory Capacity

Importance of Memory Bandwidth:

Balancing Core Count and Memory Bandwidth

Network Interfaces

Factors to Consider

Differences Between Cloud VM and On-Premises Enterprise Deployments

Rules of Thumb

Workload Scenario Walk Through

Web Server

Database Server

Streaming Video Delivery

CDNs

High Performance Storage

LLM Training

LLM Inference

Real-time Data Processing

Summary

References

Glossary

更多精彩文章

Compute Core Counts

Memory Capacity and Bandwidth

Importance of Memory Capacity

Importance of Memory Bandwidth:

Balancing Core Count and Memory Bandwidth

Network Interfaces

Factors to Consider

Differences Between Cloud VM and On-Premises Enterprise Deployments

Rules of Thumb

Workload Scenario Walk Through

Web Server

Database Server

Streaming Video Delivery

CDNs

High Performance Storage

LLM Training

LLM Inference

Real-time Data Processing

Summary

References

Glossary

Emerging Patterns in AI Workflows, and Their Impact on Scale-Out Networking

2024年9月19日

The Shift towards Fabric-Resident Stateful Services

2024年8月7日

AI Scale-up Networking Trends

2024年7月3日

Simplifying the dataflow with a switch fabric!

2024年6月17日

Computing and Software Trends

2024年5月14日

devmem-TCP: Back to the Future

2024年1月26日

RDMA: Ethernet's Leap Forward

2023年12月23日

RDMA Networking Trends

2023年10月12日

CXL Fabrics Landscape

2023年9月26日

Winds of Change!

2023年8月24日