登录查看更多内容

Practical Approach to Design Non-blocking, High-Performance Computing (HPC) Infrastructure for AI Workloads Clusters

Altaf Ahmad

CCDE | 6x CCIE | JNCIE-DC | CISSP | CKA | AI/ML Enthusiastic | SDN | Azure Solution Architect | VCIX-NV | NVIDIA IB Professional

发布日期: 2024年8月28日

Solution:

Designing a non-blocking, high-performance computing (HPC) infrastructure for AI workloads involves careful planning and selection of components to ensure that data flows seamlessly between compute resources, storage, and network without any bottlenecks.

We need to design a 16-GPU AI training cluster that provides high-performance computing (HPC) and low-latency communication between nodes.

Suppose we have compute nodes, each equipped with 2 GPUs and 8x 25G ports for data transfer. Therefore, a total of 8 nodes will be required to meet the 16-GPU requirement.

?To fulfil the technical requirements, 8 leaf switches (each with 32x 25G downlink ports and 2x 400G uplink ports) are necessary. Currently, 8x 25G ports are allocated for server connectivity, while 2x 400G uplink ports will be used for spine connectivity.

?Additionally, 2 spine switches will be needed to connect the leaf switches. This configuration will utilize a VXLAN-EVPN fabric in a spine-leaf architecture as shown in below figure:

Dana Gardner 5 年前

DRAM Choices Are Suddenly Much More Complicated

AKEN Cheung 封装基板制造商 11 个月前

The New Era for AI compute: Holistic Design

Mark Papermaster 12 个月前

Non-Blocking Fabric to Provide 16x GPU to GPU connectivity

?In the above figure, each GPU node (server) is connected to a leaf switch via a 25G port. Let's evaluate the capacity of this fabric to ensure it meets the AI cluster's requirements for GPU-to-GPU connectivity:

Total Leaf Switches: 8, each equipped with 32x 25G ports. Assuming each leaf switch is fully populated, the total downlink bandwidth would be 32x 25G = 800G.
Total Spine Switches: 2, with each leaf switch having 2x 400G uplinks, resulting in a total uplink bandwidth of 2x 400G = 800G.
Oversubscription Ratio: Uplink bandwidth to downlink bandwidth is 800G/800G, which gives us a 1:1 oversubscription ratio.
Total GPU Nodes: 8, each with 8x 25G ports and 2 GPUs, resulting in a total of 16 GPUs.

Assuming the fabric is fully populated with each Node is equipped with 2xGPU, it can support up to 512 GPUs with a 1:1 oversubscription ratio, ensuring a non-blocking fabric that meets the required performance for GPU-to-GPU connectivity.

Disclaimer: Above example is vendor agnostic.

Author: Altaf Ahmad?

要查看或添加评论，请登录

Altaf Ahmad的更多文章

When and Where to Use EVPN with Suitable Data Plane (MPLS, VXLAN, PBB) Protocols?

2024年11月26日

When and Where to Use EVPN with Suitable Data Plane (MPLS, VXLAN, PBB) Protocols?

EVPN Use Cases with Different Data Plane Protocols Ethernet VPN (EVPN) is a versatile control plane protocol that…

2 条评论
Understanding Kubernetes (K8s) Terminologies

2024年11月14日

Understanding Kubernetes (K8s) Terminologies

By: Altaf Ahmad 1. Kubernetes (K8s) Kubernetes (also called K8s) is a tool that helps you manage applications by…

3 条评论
Emerging Threats and Trends in Cybersecurity: How to Stay Ahead in 2024

2024年10月24日

Emerging Threats and Trends in Cybersecurity: How to Stay Ahead in 2024

Altaf Ahmad Cybersecurity is no longer just a concern for the IT team—it’s a business-critical issue that affects…
Python Libraries for Network Engineering

2024年10月17日

Python Libraries for Network Engineering

How Network Engineers Should Use Python? 1. Network Automation and Configuration Management Python can be used to…

3 条评论
AI in SDN, NFV and SD-WAN

2024年10月9日

AI in SDN, NFV and SD-WAN

AI-enhanced controllers play a pivotal role in modern networking, particularly in Software-Defined Networking (SDN) and…

1 条评论
Securing AI Network Architecture

2024年10月3日

Securing AI Network Architecture

Designing network security for an AI-based network involves securing the data, models, and infrastructure that power AI…

3 条评论
How to calculate time to train AI Training model? Networking Factors That Impact AI Model Training Time

2024年9月27日

How to calculate time to train AI Training model? Networking Factors That Impact AI Model Training Time

Calculating the time required to train an AI model in a distributed or cloud-based environment, networking becomes a…

4 条评论
What is RoCE? How RoCE Benefits AI-Based Networks and HPC? RoCE Use-Cases, What are Supported Vendors?

2024年9月16日

What is RoCE? How RoCE Benefits AI-Based Networks and HPC? RoCE Use-Cases, What are Supported Vendors?

What is RoCE? RoCE (RDMA over Converged Ethernet) is a key technology for accelerating data transfer in AI, HPC, and…

1 条评论
Understanding NVIDIA InfiniBand Networking: Routing, Switching, and Its Benefits for AI Infrastructure and High-Performance Computing (HPC)

2024年9月5日

Understanding NVIDIA InfiniBand Networking: Routing, Switching, and Its Benefits for AI Infrastructure and High-Performance Computing (HPC)

What is InfiniBand Networking? InfiniBand is a high-performance networking technology used in data centers…

1 条评论
Networking Topologies for Front-End and Back-End AI Infrastructure: Automation and Orchestration Tools for AI-Based Applications

2024年9月1日

Networking Topologies for Front-End and Back-End AI Infrastructure: Automation and Orchestration Tools for AI-Based Applications

Front-End Infrastructure for AI Workloads refers to the network architecture, hardware, software, and services that…

1 条评论

See all articles

Practical Approach to Design Non-blocking, High-Performance Computing (HPC) Infrastructure for AI Workloads Clusters

Altaf Ahmad

CCDE | 6x CCIE | JNCIE-DC | CISSP | CKA | AI/ML Enthusiastic | SDN | Azure Solution Architect | VCIX-NV | NVIDIA IB Professional

领英推荐

Altaf Ahmad的更多文章

社区洞察

其他会员也浏览了

Hands-on with Dell APEX Navigator, L4 Review, Noctua Fans Galore, More...

DRAM chip raise up in Q4

FMS 2024: Phison Showcases Award-Winning AI and Enteprise Solutions

2023 Flashback: Phison Blazes Trails for Enterprise NAND Flash in the AI Data Center Ecosystem, Honored for Excellence by Micron, KIOXIA

Exploring the Value of Intel? Accelerator Engines

Rethinking the Data Center: Envisioning a Unified Computing Chip

Q&A: ASUS on AI server trends, FuriosaAI partnership and more

AI Server Showdown: Foxconn vs. Super Micro The Battle for H100 Supremacy

Computex Chronicles Part 3: Arm Unveils New Architectures and AI Libraries

FibreChannel Still Winning in the Data Center

领英推荐

Altaf Ahmad的更多文章

When and Where to Use EVPN with Suitable Data Plane (MPLS, VXLAN, PBB) Protocols?

Understanding Kubernetes (K8s) Terminologies

Emerging Threats and Trends in Cybersecurity: How to Stay Ahead in 2024

Python Libraries for Network Engineering

AI in SDN, NFV and SD-WAN

Securing AI Network Architecture

How to calculate time to train AI Training model? Networking Factors That Impact AI Model Training Time

What is RoCE? How RoCE Benefits AI-Based Networks and HPC? RoCE Use-Cases, What are Supported Vendors?

Understanding NVIDIA InfiniBand Networking: Routing, Switching, and Its Benefits for AI Infrastructure and High-Performance Computing (HPC)

Networking Topologies for Front-End and Back-End AI Infrastructure: Automation and Orchestration Tools for AI-Based Applications

社区洞察

其他会员也浏览了

Hands-on with Dell APEX Navigator, L4 Review, Noctua Fans Galore, More...

DRAM chip raise up in Q4

FMS 2024: Phison Showcases Award-Winning AI and Enteprise Solutions

2023 Flashback: Phison Blazes Trails for Enterprise NAND Flash in the AI Data Center Ecosystem, Honored for Excellence by Micron, KIOXIA

Exploring the Value of Intel? Accelerator Engines

Rethinking the Data Center: Envisioning a Unified Computing Chip

Q&A: ASUS on AI server trends, FuriosaAI partnership and more

AI Server Showdown: Foxconn vs. Super Micro The Battle for H100 Supremacy

Computex Chronicles Part 3: Arm Unveils New Architectures and AI Libraries

FibreChannel Still Winning in the Data Center