Emerging Patterns in AI Workflows, and Their Impact on Scale-Out Networking
Emerging pattern for AI Model Creation through Service Delivery

Emerging Patterns in AI Workflows, and Their Impact on Scale-Out Networking


Disclaimer: The views and opinions articulated in this article solely represent my perspective and do not reflect the official stance of my employer or any other affiliated organization.

Introduction

As artificial intelligence (AI) continues to evolve, distinct patterns have emerged in AI workflows, each with unique networking requirements. These patterns can be categorized into three market segments:

  1. AI Frontier Factories: Facilities dedicated to pre-training large-scale multimodal models such as Large Language Models (LLMs), Diffusion Models (DMs), State-Space Models (SSMs), and foundational robotic models that understand physics. In Sam Altman's 15 words - deep learning worked, got predictably better with scale, and we dedicated increasing resources to it. Due to their capital-intensive nature—often requiring investments of tens of billions of dollars—these factories are operated by a select group of large organizations like OpenAI, Microsoft, Cohere, and Meta.
  2. Service Clouds: Platforms responsible for deploying AI applications from the cloud to the edge. To achieve economic viability, Service Clouds must integrate AI workflows while supporting existing applications. These include large hyperscalers like AWS, private clouds within enterprises like JPMorgan, on-premises enterprise services, and edge compute environments. Categorizing on-premises and edge infrastructures as clouds signifies the shift toward cloud-native automation in enterprise and edge markets—a trend accelerated by AI.
  3. Data Foundries: Deep learning works even better with the right kind of data, not just volumes of raw data. The data foundries specialize in collecting, filtering, labeling, distilling, and curating data, as well as generating synthetic and hybrid data. They cater to diverse customers, including Frontier Factories and Service Clouds. Data foundries are vital for developing domain-specific multi-agent systems, helping to steer, fine-tune, and ground models with tailored, curated data. Success in this space hinges on curating and generating the right kind of data—integrating multi-modality, physics traces, building world models, reasoning chains, and semi/unsupervised agentic tool workflows. Companies like Scale.ai lead the way, serving Frontier Factories like OpenAI and private clouds like SAP or enterprise clouds like Brex. We'll talk about the data foundries in a different article.

This article examines the differing scale-out networking needs of Frontier Factories and Service Clouds, highlighting the opportunities and challenges in optimizing networking infrastructures for each environment.


Scale-out Networking within the ACS Node

An Accelerated Compute Server (ACS) typically contains 8 to 16 GPUs, interconnected via PCIe switches and point-to-point memory fabrics like NVLink. The current ACS design is an early model and will likely evolve rapidly to become more GPU-centric and dataflow-centric.


he Accelerated Compute Server (ACS) with the backend scale-up network.

Scale-out architectures distribute workloads across multiple ACS nodes, enabling linear performance gains as additional nodes are added. This approach is critical for AI training and inference, where massive datasets and complex models demand scalable compute and memory resources.

Traditional network adapters were designed for CPU workloads and require re-design to accommodate ACS nodes with multiple GPUs. This may require:

  • Direct Data Transfers: Support for techniques like peer-2-peer PCIe DMA, and dma-buf for direct GPU <-> GPU and GPU <-> Storage data transfers. And, the separation of control and data plane (see ZeroNIC Research Paper) for parallel scaling.
  • A Robust Scale-Up Fabric: An internal network within the ACS node to interconnect GPUs efficiently.
  • Multi-GPU Network Adapters: Capable of switching, load balancing, and handling increased data flow to other ACS nodes.



Enfabrica's SuperNIC supporting multiple GPUs, a scale-up memory fabric, and a scale-out network fabric.

As an illustration, Enfabrica's SuperNIC—presented at industry conferences—supports multiple GPUs with a scale-up memory fabric and a scale-out message fabric, showcasing the integration of scale-up and scale-out networking within a single solution.

Transitioning to multi-GPU network adapters is essential for simplifying data flow and maximizing performance.


Scale-Out Networking Requirements

As AI evolves, the scale-out networking requirements for AI Frontier Factories and Service Clouds diverge significantly. Understanding these differences is crucial for designing effective networking solutions tailored to each use case.


AI Frontier Factories: Custom Scale-Out Networks

AI Frontier Factories employ thousands to millions of GPUs working in parallel, necessitating networking solutions capable of handling immense data exchanges with high throughput.

Currently, each GPU typically has its own backend network adapter, but there is a rapid shift toward supporting multiple GPUs per adapter to optimize throughput.

Key Characteristics of Frontier Factory Networking:

  • Dedicated Frontend and Backend Networks: Frontier Factories use separate networks for frontend and backend operations. The backend network is optimized for high-bandwidth, low-latency communication between GPUs during distributed training. The frontend network handles external communication, data ingress and egress, and administrative tasks. This separation minimizes interference from non-AI workloads, reducing training times and costs.


The frontend and backend networks in the Frontier Factory.

  • Customized Backend Networks: Often custom-designed to leverage specific traffic patterns, such as all-to-all collectives used in distributed training algorithms. Rail-optimized networks are suitable for many AI models (except recommendation models) and exploit the decrease in bandwidth requirements as data moves up network tiers, reducing costs without impacting training performance.
  • Network Flexibility: Rapid changes in model architectures, parallelism methods, and network scale (across multiple data centers) alter communication patterns. Networks must adapt through flexible software tuning and, in some cases, custom hardware solutions.


Challenges in Scaling Frontier Clusters:

  • Reliability Issues: As clusters grow, the system becomes more prone to failures, necessitating frequent checkpointing and resulting in reduced efficiency.
  • Communication Bottlenecks: Synchronous training can become a bottleneck, further decreasing efficiency.


Asynchronous and Synchronous Gradient Descent (Credit: O'Reilly's Hands-On Convolutional Neural Networks with TensorFlow)

Solutions:

Frontier customers are addressing these issues using asynchronous and distributed approaches

Working closely with Frontier Factory customers is critical to developing the right scale-out networking products, as training methods are changing rapidly in custom ways.


Service Clouds: The Untapped Opportunity


We often anthropomorphize AI, equating larger models with greater intelligence. But AI doesn’t work like humans.

Models can be copied, fine-tuned, and distilled, spreading intelligence across Service Clouds and multi-agent workflows. The real power lies in the network, not in the size of any single model. The key question isn’t about bigger or smaller models, its about the mesh-of-models, each distilled and fine-tuned for a domain tasks, interacting locally, and who can call upon any other model across the network. What truly matters are the domain-specific abilities, how models are networked, software workflows, latency requirements, and the cost of service.

Dr. Jim Fan showcasing the shift from training to inference with OpenAI's Strawberry (GPT-01 preview) model.


With the advent of models like GPT-01-preview, we now have scaling laws for inference , signaling a shift in focus from Frontier Factories (foundation models) to Service Clouds . We sensed the shift to Inference at Hot Chips 2024. The tide has shifted towards the Service Clouds.

Role of Service Clouds:

Service Clouds deliver enterprise and consumer software applications across hyperscale and tier-2 clouds, enterprise, and edge infrastructures. They handle:

  • Training of Pre-Trained Models: Distilling, grounding, and Fine-tuning models for domain-specific tasks.
  • AI Inference: Running models to generate predictions or outputs.
  • Integration with Traditional Applications: Combining AI capabilities with existing services.

The primary challenge is integrating ACS infrastructure without disrupting existing workloads. Over time, existing software applications will be optimized to run more efficiently on parallel GPU infrastructure.


Like Frontier Factories, scale-out network adapters in Service Clouds will need to support multiple GPUs within an accelerated node and include an internal fabric.

Infrastructure Considerations:

Given the diversity of workloads, Service Clouds need to support various combinations of compute, memory, and storage on the ACS node:

  • Inference Servers: May require additional CXL memory for the pre-fill phase, paged-attention mechanisms, and RAG workflows.
  • Fine-Tuning: Might need enhanced data storage or generation capabilities.
  • Multi-Agent Systems: The multiple agents might work together using traditional software workflows. Many agents need to interact before taking actions, so inference needs to be faster by an order of magnitude.

Currently, AI services in Service Clouds are fragmented, often using separate infrastructure from traditional services, and with distinct frontend and backend networks. Consolidating these infrastructures is essential for integrating AI capabilities into all software applications, thereby reducing costs and increasing deployment flexibility.

While the shape of future usecases is unknowable, we need all our tools - both our existing software technologies and new AI and software inventions - to integrate and work together, to power this AI-infused software revolution.


Recommendations for Future Service Cloud Infrastructure

  • Network Partitioning: The network needs to be partitionable using Software-Defined Networking (SDN) configurations to run workloads on specific compute and storage targets in an isolated manner while sharing the global network to serve end-customers. Running AI services in their own isolated compute and storage clusters enables the unification of the backend and frontend clusters into a single unified network, significantly reducing network costs.
  • Shared Network Infrastructure: Adopt a unified scale-out network that supports all workloads, eliminating the separation into frontend and backend networks. This integration allows all services to run on a common platform, streamlining operations. Enterprises and edge environments are increasingly adopting cloud networking principles. We should drive towards a standardized cloud-like "baseline" design across all Service Clouds.
  • Scaling across Service Clouds: Networking solutions must be scalable across cloud, enterprise, and edge environments, supporting multiple speed grades for different deployment scenarios and accommodating differing protocols—such as high-performance Remote Direct Memory Access (RDMA) for clouds and robust TCP for unreliable edge networks.
  • AI Service Capabilities: AI serving requires strict service guarantees to meet throughput and latency requirements. Incorporating key features from the UltraEthernet Consortium UEC specification—such as multi-pathing, selective retransmission, switch telemetry, and receiver-driven credits— all crucial to support Service Level Agreements (or SLAs). These AI-specific enhancements should be integrated without significantly deviating from industry standards to maintain compatibility and ease of deployment.

Consolidating frontend and backend networks integrates AI workflows with existing software processes like Retrieval-Augmented Generation (RAG), knowledge bases, and graph traversals. As we transition to multi-agent workflows and compound models, a unified yet partitionable network offers the flexibility to deploy diverse applications across different hardware and cluster sizes. This infrastructure combines both AI-specific innovations and traditional software workflows.

In summary, the next-gen service cloud requires an advanced, general-purpose network, not a custom AI backend network. This presents a major opportunity to redefine service cloud networking for the future.


Conclusion

The approach to defining scale-out networking products differs between Frontier Factories and Service Clouds:

Defining product requirements for the Frontier Factory vs. Service Cloud.

While Frontier Factories rely on specialized, custom networking solutions optimized for high performance, Service Clouds require a more generalized and scalable approach.

The Service Clouds are not AI-ready today. The market for AI-ready Service Clouds will grow rapidly, becoming significantly larger than Frontier Factories—potentially 10 to 100 times bigger—due to the widespread adoption of AI services across all industries. By rethinking networking for Service Clouds, and layering AI capabilities onto a single, unified networking infrastructure, we can deliver AI + software services at scale, unlocking the full potential of AI applications to the broader market.

Links


  1. Andrew Ng Multi-Agent Talk at Sequoia Capital -YouTube Video
  2. Enfabrica Blog
  3. Microsoft Singularity Paper
  4. Multi-Datacenter Training: OpenAI's Ambitious Plan To Beat Google's Infrastructure : SemiAnalysis
  5. Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
  6. Efficient Memory Management for Large Language Model Serving with Paged-Attention
  7. ZeroNIC Paper


Glossary of Terms

  1. All-to-All Collectives: Communication patterns in distributed computing where every node communicates with every other node, essential for certain parallel processing tasks in AI training.
  2. Asynchronous Training: Training method where updates to model parameters occur without waiting for all nodes to synchronize, improving efficiency in large-scale systems.
  3. Backend Network: In AI infrastructures, a specialized network optimized for high-bandwidth, low-latency communication between GPUs or compute nodes during tasks like training.
  4. Diffusion Models (DMs): A class of generative models in machine learning that learn data distributions by simulating the reverse of a diffusion process.
  5. Fine-Tuning: The process of taking a pre-trained AI model and adjusting it with additional training data to specialize it for specific tasks.
  6. Frontend Network: In AI infrastructures, the network responsible for external communications, including data ingress and egress, user interactions, and administrative tasks.
  7. Graph Databases: Databases that use graph structures with nodes, edges, and properties to represent and store data, facilitating complex queries and relationships.
  8. Large Language Models (LLMs): AI models trained on extensive text datasets to understand and generate human-like language, capable of tasks like translation, summarization, and question-answering.
  9. Load Balancing: Distributing workloads evenly across multiple computing resources to optimize performance, improve resource utilization, and prevent resource bottlenecks.
  10. Multi-Agent Compound Models: AI models that involve multiple interacting agents or components working together to perform complex tasks or simulations.
  11. Multipathing: Establishing multiple network paths between devices to provide redundancy and avoid congested paths.
  12. Network Partitioning: Dividing a network into multiple segments or partitions using technologies like Software-Defined Networking (SDN) to isolate workloads and improve security and performance.
  13. RAG (Retrieval-Augmented Generation): A method in AI where information retrieval is combined with text generation to produce contextually relevant and accurate responses.
  14. RDMA (Remote Direct Memory Access): Allows direct memory access from the memory of one computer into that of another without involving the operating system, reducing latency.
  15. Receiver-Driven Credits: A mechanism where the data receiver controls the flow of data by providing credits to the sender, indicating how much data it can handle.
  16. RoCE (RDMA over Converged Ethernet): A network protocol that enables RDMA over Ethernet networks, combining the efficiency of RDMA with the ubiquity of Ethernet.
  17. Scale-Out Fabric/Network: The external network infrastructure that connects multiple ACS nodes in a cluster, facilitating communication and data exchange across the system.
  18. Scale-Up Fabric/Network: The internal interconnect within an ACS node that links multiple GPUs together for high-speed, low-latency communication.
  19. Selective Retransmission: A network error correction method where only the specific lost or corrupted data packets are retransmitted, enhancing efficiency.
  20. Software-Defined Networking (SDN): A networking approach that uses software-based, logically centralized controllers to manage network resources dynamically and programmatically, improving flexibility and efficiency.
  21. SSMs (State-Space Models): AI models used for sequence modeling and time-series analysis, representing data through states and transitions.
  22. Synchronous Training: Training method where updates to model parameters occur after all nodes have synchronized, ensuring consistency at the expense of speed.
  23. TCP (Transmission Control Protocol): A fundamental internet protocol that provides reliable, ordered, and error-checked delivery of data between applications.


Vinodh Raghunathan, PhD

Director of Edge SW Platform Product, Open Source and Operating Systems

5 个月

Very interesting. Is the PCIE switch a CXL switch too?

要查看或添加评论,请登录

Rakesh Cheerla的更多文章

  • Surfacing Our Inner Thoughts

    Surfacing Our Inner Thoughts

    Our words shape our reality. Language is more than a tool for communication—it is the forge where ideas are tested…

    2 条评论
  • The Shift towards Fabric-Resident Stateful Services

    The Shift towards Fabric-Resident Stateful Services

    Disclaimer: The views and opinions articulated in this article solely represent my perspective and do not reflect the…

  • AI Scale-up Networking Trends

    AI Scale-up Networking Trends

    Disclaimer: The views and opinions articulated in this article solely represent my perspective and do not reflect the…

    26 条评论
  • Simplifying the dataflow with a switch fabric!

    Simplifying the dataflow with a switch fabric!

    Disclaimer: The views and opinions articulated in this article solely represent my perspective and do not reflect the…

    3 条评论
  • The Balancing Act: Compute, Memory & Network

    The Balancing Act: Compute, Memory & Network

    The performance of applications is heavily influenced by compute cores, memory, and network interfaces. Striking the…

    4 条评论
  • Computing and Software Trends

    Computing and Software Trends

    Let's start with evidence: Neural Processing Engines: Intel AI PC, Qualcomm NPU and NVIDIA Jetson are examples of PC…

  • devmem-TCP: Back to the Future

    devmem-TCP: Back to the Future

    Audience: Geared towards product managers, field teams, tech executives, and engineering students. Seasoned network…

    10 条评论
  • RDMA: Ethernet's Leap Forward

    RDMA: Ethernet's Leap Forward

    RDMA is increasingly the transport protocol of choice for many workloads across AI Training/Inference, HPC Networks…

    21 条评论
  • RDMA Networking Trends

    RDMA Networking Trends

    Introduction The rise of Artificial Intelligence (AI) has dramatically heightened the demand for robust, efficient and…

    5 条评论
  • CXL Fabrics Landscape

    CXL Fabrics Landscape

    Introduction AI/HPN training networks typically have at least two distinct networks - Scale-Up and Scale-Out, each…

    10 条评论

社区洞察

其他会员也浏览了