登录查看更多内容

Enfabrica: Accelerating AI GPU Communication

Jack Poller

Principal Cyber Security Industry Analyst | Strategic Leader in Marketing and Technology

发布日期: 2024年10月14日

Massive datasets are the lifeblood of AI models, fueling training and enabling accurate predictions. The insatiable hunger for data has profound implications for the underlying network infrastructure, pushing the boundaries of traditional computer and networking architecture.

The Current State of AI Computer Networking Architecture

Present-day AI networking relies heavily on a hierarchical structure of interconnected components. This hierarchy typically comprises:

GPUs responsible for processing vast amounts of data in parallel.
PCI Switches connecting multiple GPUs within a server, facilitating communication and data exchange.
RDMA NICs (Remote Direct Memory Access Network Interface Cards) enable direct memory access between GPUs across different servers, minimizing CPU involvement and improving data transfer speeds.
Network Switches form the backbone of the leaf-spine network, connecting servers and facilitating communication across the data center.

Traditional AI Computer Network Architecture (Credit: Enfabrica)

This traditional approach, while functional, suffers from several key limitations that hinder the scalability and efficiency of AI workloads:

Inter-GPU Communication Bottlenecks: As the number of GPUs in a cluster increases, the hierarchical nature of the network creates bottlenecks. Data often has to traverse multiple switches and NICs, adding latency and reducing overall throughput.
Limited Bandwidth and Resilience: Existing architectures struggle to keep up with the exponential growth in bandwidth demands of AI workloads. Moreover, single points of failure, such as cable drops, can disrupt entire training jobs, leading to costly checkpoints and restarts.
Lack of Composability: Traditional architectures lack the flexibility to support diverse AI applications that require different combinations of compute and memory resources. This rigidity hinders innovation and adaptability in AI development.
Escalating Total Cost of Ownership (TCO): Scaling AI infrastructure with traditional components significantly increases TCO due to the cost of networking hardware, power consumption, and cooling requirements.

Enfabrica's Solution: The Accelerated Compute Fabric

Enfabrica proposes a radical departure from the conventional approach with its Accelerated Compute Fabric (ACF) technology. ACF embraces a MegaNIC concept, consolidating the functionalities of PCI switching, RDMA, and first-tier network switching into a single, high-bandwidth, highly resilient device.

The ACF achieves its remarkable performance and efficiency through a unique architectural design. The solution integrates multiple high-speed Ethernet NICs, interconnected by internal crossbar switches. These switches create a high-bandwidth, non-blocking fabric that allows data to flow seamlessly between any connected port. A key innovation is the separation of packet header processing and payload transfer. The NICs within the ACF process packet headers and make forwarding decisions, while the payload data is directly transferred between endpoints via DMA (Direct Memory Access), bypassing the NICs and minimizing latency. This approach allows for extremely efficient data movement, crucial for the demands of AI workloads.

Enfabrica Accelerated Compute Fabric (Credit: Enfabrica)

The ACF’s architecture provides:

锐捷网络股份有限公司 1 年前

How Samsung Is Scaling Up Data Centers for the…

Samsung Semiconductor 1 年前

Revolutionizing AI/ML: Edgecore’s AGS8200 & Intel?…

?ukasz ?ukowski 7 个月前

Converged PCI and Ethernet Crossbar: ACF integrates PCI switching and Ethernet networking capabilities, creating a direct, low-latency path for data transfer between GPUs and across the network. This consolidation eliminates intermediate hops, reducing latency and boosting performance.
Massive Bandwidth and Path Diversity: ACF provides a substantial increase in bandwidth, supporting up to 3.2 terabits per second on the network side and 5 terabits per second on the host/accelerator side. This bandwidth capacity, coupled with multipath capabilities, ensures high throughput and mitigates the impact of component failures.
Programmable Transport and Congestion Control: ACF incorporates a programmable transport layer that operates on a standard CPU, enabling customers to customize congestion control mechanisms and tailor network behavior to specific workloads. This flexibility enables efficient scaling and adaptation to evolving demands.
Composability and Heterogeneity: ACF's architecture supports diverse compute and memory resources, including GPUs, CPUs, storage, and CXL-attached memory. This enables the creation of tailored systems optimized for specific AI applications, fostering innovation and adaptability.

With Enfabrica’s ACF, each GPU is directly connected to all Ethernet interfaces in the chip rather than just a single NIC. This expands the throughput available to each GPU to the throughput of the fabric (3.2 Tbps). At AI Field Day 5 , Rochan Sankar , Enfabrica’s CEO said “The role of a PCI networking card has no relevance in AI going forward.”

In addition to AI training workloads, the ACFS's high-bandwidth memory access capabilities can also benefit inference workloads and Retrieval Augmented Generation (RAG) by providing a large, shared memory pool accessible by multiple GPUs with low latency. “We think this is huge for RAG because RAG is effectively going to be about 75% the retrieval part and what this can do is effectively reduce and make the fleet more efficient,” Mr. Sankar said.

Potential Disadvantages of Enfabrica's Solution

While Enfabrica's solution offers compelling advantages, some potential disadvantages merit consideration:

Hardware Dependency: ACF requires modifications to existing server designs, making it incompatible with current off-the-shelf systems. This may hinder adoption, particularly for organizations with existing infrastructure investments.
Single Point of Failure: While ACF mitigates numerous points of failure through its multipath architecture, the device itself represents a single point of failure. A failure at the ACF level could disrupt all connected GPUs. Though the failure rate is estimated to be low due to the device's design, it's still a factor to consider.
Limited Compatibility: Enfabrica's decision to focus on existing InfiniBand verbs and RoCE, rather than incorporating Ultra Ethernet immediately, reflects a pragmatic approach driven by customer requirements – there is an urgent need for solutions that address the scalability challenges faced by current AI deployments. By prioritizing compatibility with established technologies, Enfabrica aims to provide a readily deployable solution for immediate needs, while keeping an eye on future advancements like Ultra Ethernet.

Why This Matters

AI workloads, particularly large language models, demand enormous amounts of data to be moved, processed, and stored. This data deluge necessitates high-bandwidth, low-latency architectures to avoid bottlenecks that can cripple AI performance.

Enfabrica, a startup focused on revolutionizing network infrastructure for AI, recognizes this challenge and proposes a radical shift in approach. Instead of treating networking as a peripheral concern, Enfabrica places it at the heart of AI computing, arguing that the network's role in AI is evolving beyond mere connectivity to become a critical performance and scalability determinant.

Enfabrica's core value proposition lies in its ability to address the key challenges of AI networking:

Reduced TCO: By collapsing multiple components into a single device and optimizing data flow, ACF significantly reduces the cost of AI infrastructure, freeing up resources for compute power.
Enhanced Performance: The high bandwidth, low latency, and multipath capabilities of ACF unlock the full potential of GPUs, accelerating training and inference tasks.
Improved Resilience: The robust architecture and failure recovery mechanisms of ACF minimize downtime and ensure consistent operation, vital for large-scale AI deployments.
Future-Proofing AI Infrastructure: The programmable transport layer and support for diverse compute and memory resources empower organizations to adapt to evolving AI workloads and future technologies.

Enfabrica's ACF represents a significant leap forward in AI networking, enabling the realization of increasingly complex and demanding AI applications. As AI continues to advance, solutions like Enfabrica's will play a crucial role in unlocking AI’s full potential and shaping the future of computing.

#AI #Networking #Innovation #DataInfrastructure #Enfabrica #AcceleratedComputeFabric #TechTrends #AIFieldDay5 #GPUComputing #FutureOfAI

Enfabrica: Accelerating AI GPU Communication

Jack Poller

Principal Cyber Security Industry Analyst | Strategic Leader in Marketing and Technology

The Current State of AI Computer Networking Architecture

Enfabrica's Solution: The Accelerated Compute Fabric

领英推荐

Potential Disadvantages of Enfabrica's Solution

Why This Matters

更多精彩文章

社区洞察

其他会员也浏览了

Cisco and Nvidia Connect to Deliver an All-in-One AI Data Center Solution

Understanding Broadcom’s StrataXGS? Chipset Families: Trident vs. Tomahawk – The Smart vs. The Speedy

FS & PicOS? Innovations: RoCE Lossless Network for HPC

VAST Data Selected as First Data Platform for Cisco Nexus HyperFabric, a Generative AI Solution Developed in Collaboration with NVIDIA

High Power AI-Driven Data Centers

The Gaudi 3 (Intel AI) Cluster is Pretty Neat

Intel's 5th Gen Xeon and Core Ultra Processors: Pioneering AI in Data Centers and Cloud Environments

Zenoh: Unifying Communication, Storage and Computation from the Cloud to the Microcontroller

Revolutionizing AI/ML: Edgecore's AGS8200 & Intel? Habana? Gaudi? 2's Breakthrough

FibreChannel Still Winning in the Data Center

The Current State of AI Computer Networking Architecture

Enfabrica's Solution: The Accelerated Compute Fabric

领英推荐

Potential Disadvantages of Enfabrica's Solution

Why This Matters

Can You Take the Dev out of Ops?

2024年11月20日

SOUTHWORKS: Empowering Cloud-Agnostic Application Development

2024年11月19日

Can You Remove the Ops from DevOps?

2024年11月18日

The Rise of AI Networking: How Arista is Evolving Data Center Infrastructure

2024年10月18日

The Evolving Landscape of AI Infrastructure: Powering the Future with Cisco

2024年10月17日

Demystifying Elastic: From Keyword Search to AI-Powered Insights

2024年10月16日

Can You Deploy Large-Scale AI On-Premises?

2024年10月15日

Commvault SHIFTs the Agenda to Resiliency

2024年10月10日

Building Smarter AI: A First Look at Integrail

2024年10月4日

How the #@!$% Do You Optimize Your Massive AI Network?

2024年10月4日

社区洞察

其他会员也浏览了

Cisco and Nvidia Connect to Deliver an All-in-One AI Data Center Solution

Understanding Broadcom’s StrataXGS? Chipset Families: Trident vs. Tomahawk – The Smart vs. The Speedy

FS & PicOS? Innovations: RoCE Lossless Network for HPC

VAST Data Selected as First Data Platform for Cisco Nexus HyperFabric, a Generative AI Solution Developed in Collaboration with NVIDIA

High Power AI-Driven Data Centers

The Gaudi 3 (Intel AI) Cluster is Pretty Neat

Intel's 5th Gen Xeon and Core Ultra Processors: Pioneering AI in Data Centers and Cloud Environments

Zenoh: Unifying Communication, Storage and Computation from the Cloud to the Microcontroller

Revolutionizing AI/ML: Edgecore's AGS8200 & Intel? Habana? Gaudi? 2's Breakthrough

FibreChannel Still Winning in the Data Center