Unsung heroes of AI revolution

Unsung heroes of AI revolution

Intel shipped its first 8086 CPU in 1978 and ushered in the personal computer revolution. The 8086 introduced the x86 architecture, which continues to be the dominant architecture for PCs today. This innovation cemented the CPU's role as the central component for executing instructions and powering both PCs and later servers in the cloud. While CPUs have historically shouldered the primary processing workload, the evolution of technology has seen selected tasks such as TCP/IP processing, encryption, graphics, compression etc. offloaded to specialized hardware to alleviate CPU load. Despite this, traditional system design is CPU-centric with offloads and accelerators playing a supporting role. However, with the unique demands of Transformers, LLMs, and the proliferation of LLM inferencing, modern system design is increasingly built around GPUs and Data Processing Units (DPUs) or SmartNICs, with CPUs in a supporting role.

Here is a picture of Nvidia DGX H100 system that says it all. It has 116736 FP32 CUDA Cores and 3,648 Tensor Cores compared to 112 CPU cores.

AI system design necessitates the use of principles, concepts, and frameworks rooted in High-Performance Computing (HPC). In response, data centers and enterprise architectures are undergoing a paradigm shift from a CPU-centric approach to embracing GPU-centric computing into the enterprise workflows. This secular trend is gaining momentum, and I expect an ongoing evolution in enterprise architecture design towards GPU-centric HPC computing.

While GPUs undoubtedly shine as the rock stars of AI computing, they can't carry the entire performance solo. They rely on a strong supporting cast of complementary technologies to truly light up the stage.

Let's dive in.

Network

AI training of large models is fundamentally a massive distributed systems and parallel computing problem. It requires parallel processing of data across GPUs as well as distributed across thousands of machines. To maximize GPU utilization, data exchange at ultra low latency and very high bandwidth is table-stakes.

NVlink

Scale-up begins at the GPU level, leveraging a highly efficient network for inter-GPU communication to unite GPUs and create the illusion of a single, colossal GPU. This capability is powered by NVlink technology for Nvidia GPUs, facilitating coherent access to a unified memory space within applications. With NVlink, programming complexities are mitigated, catering to the demands of trillion-parameter LLMs.

NVlink includes features such as a mesh topology and unified memory, enabling GPUs to share a common memory pool while supporting direct memory access. This eliminates CPU involvement, allowing GPUs to directly access each other's memory and drastically reduce network latency.?

Further, NVSwitch interconnects multiple GPUs via NVlink interfaces, enabling high-bandwidth multi-node GPU clusters.

The latest version of NVlink, along with the NVlink switch, can handle training enormous AI models with up to 7.2 trillion parameters. It can support up to 576 GPUs working together like one giant GPU!!! This system delivers an impressive 1.8 terabytes per second of bidirectional bandwidth which is over 14 times faster than the latest PCIe Gen5. Additionally, it provides a massive 240 terabytes of fast memory. With 576 of Nvidia's Blackwell GPUs, the system can reach a staggering 1 petabyte per second of processing power.

Infinity Fabric and its underlying XGMI interconnect is AMD’s alternative to Nvidia’s NVlink.

Infiniband, RoCEv2, iWARP

Scaling out machines require inter-node communication, and similar to inter-GPU communication, high-bandwidth and low latency data exchange requires direct access to remote node’s memory without the need to first copy into node memory or involve the CPU. This capability is essential in highly distributed clusters required for AI training and inference.

In such environments, traditional East-West traffic (communication between machines in the cluster) falls short, emphasizing the need of Remote Direct Memory Access (RDMA). The ability to minimize jitter and handle burstiness, common traffic patterns during training, is important as well.

InfiniBand has served as the workhorse of HPC for over a decade and is well-suited to meet the demands of AI infrastructure. However, with Ethernet reigning as the dominant Physical Layer (PHY), RDMA over Converged Ethernet (RoCEv2) is a solid alternative. RoCEv2 focuses on minimizing loss and retransmissions, while ensuring compatibility with InfiniBand verbs, seamlessly providing RDMA over Ethernet for applications originally designed for InfiniBand.

iWARP is another technology which provides RDMA on top of TCP/IP and uses Ethernet similar to RoCEv2.

While 25 or 50 Gbps bandwidth may suffice in traditional data centers, AI demands up to 400 Gbps bandwidth and beyond to keep up with GPU advancements.

Nvidia's latest Quantum-X800 InfiniBand switches support 800 Gbps ports. By comparison, Ethernet switching is also pushing to 800 Gbps port speeds.

The decision between RoCEv2, iWARP, and IB depends on distinct requirements such as cost, performance, and specific use cases. Abundant resources exist online for comparing their pros and cons. What's promising is that all three technologies support a shared user API though have different physical and link layers. An application developed using ibverbs can seamlessly transition to any of the available networks.

GPUDirect RDMA, embedded within Nvidia GPUs takes the concept one step further and it facilitates seamless data exchange between GPUs and third-party devices via PCIe. Whether it's SmartNICs like ConnectX, DPUs, or video adapters, this approach bypasses the need for data to traverse through the CPU, eliminating the bottleneck of a CPU bounce buffer.

CPU (for inferencing)

When it comes to training Large Language Models (LLMs), GPUs are typically the preferred hardware due to their computational power. However, for deployment and inferencing tasks, the latest CPUs can be surprisingly effective, especially for models with fewer than 20 billion parameters and within a latency tolerance of up to 100 milliseconds. Beyond the 20 billion parameter mark or for applications requiring very strict latency requirements, GPUs remain the top choice. The benchmark of 20 billion parameters and the latency threshold of 100 milliseconds is from the inference performance of Intel's 5th Gen Xeon processor, Emerald Rapids.?

To support unique needs of deep learning, Intel started shipping Advanced Matrix Extensions (AMX) starting from Sapphire Rapids, the 4th generation Xeon processors. AMX supports INT8 datatype which has gained popularity for model quantization. Quantization involves converting the model's parameters, such as weights and biases, from floating-point numbers to integers with reduced bit widths. For instance, this could mean converting from 32-bit floating-point to 8-bit integers. Integer operations are inherently faster than floating-point operations, leading to accelerated computation, reduced power consumption, and notably, smaller model sizes. However, this optimization comes with some hit on accuracy which for real-life use cases may be acceptable when weighed against the overall benefits.?

CPUs offer several advantages over GPUs in terms of Total Cost of Ownership (TCO), power consumption, and supply, as they are not as backlogged as GPUs. The decision between using CPUs or GPUs for inferencing also hinges on the specific application's requirements. In applications like code generation vs chatbots, the choice of hardware can significantly impact user experience (wrt latency). CPU inferencing is particularly advantageous for applications that involve a combination of general-purpose workloads and AI tasks. For instance, in video conferencing, tasks like transcription and translation constitute AI workloads, while activities like screen sharing and real-time audio/video are considered general-purpose workloads.

Memory

In traditional computing, the pace of DRAM advancements has struggled to keep pace with the rapid increases in CPU speed, particularly as core counts increase. However, AMD took proactive steps to tackle this disparity as early as 2008 by collaborating with SK Hynix to pioneer High Bandwidth Memory (HBM). HBM not only resolves the bottleneck caused by narrow memory bus widths (such as the significant leap from 64 bits to 1024 bits compared to DDR5) to enhance bandwidth, but it also achieves a remarkable reduction of over 40% in power consumption by reducing clock rate.

By 2023, SK Hynix dominated the HBM market, controlling roughly 50%, closely followed by Samsung with 40%, and Micron with 10%. Interestingly, although AMD pioneered the development of HBM technology, ironically it's Nvidia that now reaps the greatest rewards, leveraging HBM to fuel the performance of their GPUs.?

As the volume of AI training continues to expand exponentially, the demand for accelerators capable of supporting terabyte-scale bandwidth becomes increasingly critical. In response to this pressing need, the latest iteration, HBM3e supports 50% more bandwidth over its predecessor HBM3. Samsung’s HBM3e can achieve an impressive bandwidth of 1.225 TB/s per stack, surpassing the capabilities of its predecessor, HBM3, which stands at 819 GB/s per stack. This substantial enhancement in bandwidth underscores the pressure on memory vendors to keep up with compute enhancements to drive the next generation of AI-driven applications.

Large Language Models (LLMs) are memory hungry and prone to becoming memory-bound if the memory connected to GPUs fails to supply data quickly enough. Enhancing memory bandwidth is instrumental in bringing GPUs closer to their peak performance levels. One metric that encapsulates this is the ops:byte ratio, delineating the number of floating-point operations per second (FLOPS) achievable for each byte of memory accessed.

For instance, the H100/FP16 has a compute bandwidth of 1979 TFLOPS and a memory bandwidth of 3.35 TB/s, resulting in an ops:byte ratio of 590. Its successor, the H200/FP16, maintains a comparable compute bandwidth of 1979 TFLOPS, yet its memory bandwidth jumps to 4.8 TB/s, yielding an ops:byte ratio of 412. The preference is to have ops:byte ratio less than 300, ideally lower than the arithmetic intensity of the algorithm.

Looking ahead, the next generation of memory technologies, such as HB3e and HBM4 (anticipated to feature a 2048 bus width), will continue to elevate the memory performance. Hopefully, these advancements will further lower the ops:byte ratio, facilitating even greater GPU compute efficiency and pushing the boundaries of AI and computational capabilities.

Storage

AI model training is a storage intensive activity with the majority of the time spent on data movement that involves data load, model load, and checkpointing. As GenAI shifts towards multimodal over time, consuming large amounts of image, video, audio, and text data will only increase the need for copious high-performance storage. During training, there are millions of small IOs with average read size of 350KB and writes upto 100 KB with heavy write burst cycles attributed to checkpointing. During inference as one would expect, it’s a read intensive workload. The majority of the time is spent moving data, and thus making storage path high-speed and efficient is critical in improving GPU efficiency and reducing idle times.

To give an idea on the storage bandwidth, for a Single Scalable Unit (SU) housing 32 Nvidia HGX H100 systems, each equipped with 256 GPUs, Nvidia recommends a storage of up to 125 GB/s for reads and 62 GB/s for writes.?On software front, userspace filesystem layered on top of distributed flash storage cluster is a must to achieve the scale and latency needed for GenAI.

To address potential bottlenecks and optimize data flow, Nvidia introduces GPUDirect Storage—a solution that allows GPUs to have direct access to local or remote storage, like NVMe or NVMe over Fabric (NVMe-oF) and avoid routing the data through CPU. This reduces CPU utilization and most importantly data copy through memory.

Power

AI's transformative potential has often been likened to electricity due to its profound economic and societal impacts. It’s a viewpoint, some might say it’s hyperbole, but the deniable fact is, AI consumes globs and globs of electricity. In contrast to traditional workloads executed on CPUs, those reliant on GPUs consume significantly more power and generate much higher levels of heat. For instance, while traditional CPU racks were designed to operate within a range of 10-20 kW per rack, modern GPU-based systems now demand upwards of 150-200 kW per rack and beyond. The power requirements of Nvidia GPUs have surged from 300 watts with the Nvidia V100 to 1000 watts for the B200 series, a notable increase in power consumption. By comparison, Intel's latest CPU, Sapphire Rapids, is rated between 120W to 350W.

To illustrate the staggering power consumption, an NVIDIA HGX H100 server alone demands 12 kW, equivalent to the power consumption of ten average U.S. homes. Power is such a big deal that data center size is measured in electricity capacity (kilowatts) unlike real estate which is measured in square footage.

Per BCG, data center electricity consumption in the U.S. accounted for 2.5% of the total electricity usage in 2022. However, this figure is projected to triple by 2030, reaching 7.5%, equivalent to the electricity consumption of approximately 40 million U.S. households, representing nearly a third of all homes in the country. These statistics underscore the pressing need for energy-efficient AI and data center technologies to mitigate the environmental impact of AI-driven advancements.

The AI gold rush has not only pushed the technological boundaries of compute power, memory, storage, network, and energy efficiency, but it has also sparked a complete reimagination of the traditional data center to meet the demands of AI workloads. This transformation has been dubbed the 'AI Data Center Revolution' by industry players. Innovative solutions addressing challenges such as cooling optimization, power forecasting, airflow management, and real-time monitoring have emerged to cater specifically to the intricate requirements of AI data centers. Data centers are experiencing once-in-a-generation demand growth, so much so that Blackstone recently committed $25 billion towards data center business.

Inference is the new web-app that will light up various business workflows. The grand GenAI infra buildout must continue combined with pushing the boundaries of compute, memory, network, storage, and power to keep up with escalating demand.

Hari S.

Applied AI | Data Platform | Distinguished scholar, Stanford GSB | Pizzaiolo

11 个月

Got quite a few messages on why I didn't mention CPUs, which is fair. Inferencing workloads for smaller models don't necessarily need GPUs and may be sufficient for various use cases. I have updated the article with a section, "CPU (for inferencing)" and how Intel's Emerald Rapids and Sapphire Rapids CPUs can be a good fit.

回复

The MM1 paper has been on my list. Would you recommend it?

回复

要查看或添加评论,请登录

Hari S.的更多文章

  • Stop Building Features, Start Telling Stories: Aristotle's Storytelling Blueprint for Engaging Products

    Stop Building Features, Start Telling Stories: Aristotle's Storytelling Blueprint for Engaging Products

    The best products don't just work, they tell stories that users want to be part of. Discover how combining Aristotle's…

  • Evolution of Product management function

    Evolution of Product management function

    As the tech industry evolves in the post-pandemic era, product teams are adopting more agile and flexible structures to…

    2 条评论
  • Agentic Data platform

    Agentic Data platform

    A data platform is a unified architecture that allows organizations to ingest, store, process, analyze, and act upon…

    3 条评论
  • The Subtractive Mindset: Embracing Simplicity for Greater Impact

    The Subtractive Mindset: Embracing Simplicity for Greater Impact

    The subtractive mindset is a cognitive approach that enhances systems, products, or processes by removing unnecessary…

    1 条评论
  • GPU and AI Model Optimizations

    GPU and AI Model Optimizations

    If you're involved with AI, you've likely been inundated by expert commentary on techniques like RAG, knowledge graphs,…

  • Modern Data Stack or Mayhem Data Stack

    Modern Data Stack or Mayhem Data Stack

    Here is a typical data evolution story of an organization. In the beginning, the organization primarily consisted of a…

  • 4 pillars of GenAI Product Strategy

    4 pillars of GenAI Product Strategy

    Product market fit (PMF) sounds straightforward, but it's highly elusive in practice. Why? It's the chasm between your…

    1 条评论
  • Macro-Economics of GenAI

    Macro-Economics of GenAI

    TL;DR By the end of 2024 Projected GenAI infra buildout cost: $154B Projected GenAI Monetization (assuming very…

    1 条评论
  • Constraints, Guardrails, and Redlines

    Constraints, Guardrails, and Redlines

    You've just been chosen to lead the charge on a game-changing project for your organization! The thrill of leaving a…

  • Enterprise Architecture - Gen AI multi-modal Applications

    Enterprise Architecture - Gen AI multi-modal Applications

    Imagine this, you conceive an idea for an app that introduces hyper-personalization to users' experiences, tailoring…

社区洞察

其他会员也浏览了