Announcing IBM Cloud cluster network for NVIDIA accelerated computing
IBM Hybrid Cloud and Infrastructure
IBM Hybrid Cloud and Infrastructure has the tools to make AI real. It’s your cloud. We protect it. You control it.
Written by Drew Thorstensen , Eran Gampel , Weiming Gu , Kiran Pillai , Seetharami Seelam
We recently announced the availability of NVIDIA H100 instance profiles on IBM Cloud. These instances are tuned and primed to support a range of AI workloads, such as large model inferencing, fine-tuning and training. However, as artificial intelligence (AI) deployments grow, so does the need to scale out the infrastructure across multiple nodes.
We are now making available a new top-level service to support multi-node scaling: our Cluster Network service. This service builds upon the foundation of IBM’s Vela infrastructure. After supercharging our network for internal workloads, we took the steps to externalize this capability and make it available to users who are looking to scale out their AI workloads.
In a properly tuned AI system, there are at least 3 distinct networks.
The first network—referred to as the “Cloud Network” (item 1 in the diagram)—is well known. Cloud Network is a standard full-fledged network that provides a full feature set for IBM Cloud, such as Security Groups, Network Access Control Lists, Transit Gateways and more. It also provides access to Cloud Object Storage, File Storage, Block Storage and all the infrastructure that is needed to support your workloads. The Cloud Network effectively connects you to the IBM Cloud Infrastructure.
The GPUs in the NVIDIA H100-based server also have an ultra-fast connection between each other. The NVIDIA NVLink fabric extends directly into the virtual machine and provides a high-speed, point-to-point connection between the GPUs. IBM refers to this fabric as the “native accelerator fabric” (item 2 in the diagram).
Within the server, there is affinity on each PCIe bus, where every NVIDIA H100 Tensor Core GPU is effectively and additionally paired with a NVMe drive and a high-speed NIC.
This brings us to the final network—the dedicated cluster network?(item 3)—which allows the GPUs to communicate with each other over a high-speed channel. For AI training or fine-tuning workloads, scaling the network out is critical. This dedicated set of lanes allows the nodes to talk directly to each other.
One of IBM’s core design points for Cluster Networking is to integrate different backend cluster networks for different solutions. What fits for an NVIDIA H100 solution may be fundamentally different than what is needed for other cluster networks. By separating the cluster network?abstraction, IBM Cloud maintains the flexibility to apply fit-for-purpose, high-speed networks for workloads, while maintaining a full feature set via the cloud NIC.
AI traffic patterns are fundamentally different from a Cloud Network. They tend to be low-entropy, high-bandwidth, point-to-point traffic. The traffic patterns of AI training workloads are micro stampedes. Our first goal is to make sure that performance is achieved.?
While performance is critical, we also determined that resiliency and redundancy in the AI network is important. We wanted to help ensure that if a link failure occurred on a cluster NIC, the workload could slow down rather than fall back to its last checkpoint and lose the work.
Our cluster-enabled NVIDIA H100 servers are outfitted with eight 400 Gbps dedicated NVIDIA ConnectX-7 NICs. The aggregate throughput of the network is 3.2 Tbps. That network is highly tuned for NVIDIA MLNX_OFED drivers and RoCE v2, and supports NVIDIA NCCL-based workloads. This is 8 times more throughput than our original, internal Vela (NVIDIA A100 Tensor Core GPU-based) supercomputer. The aggregate throughput of 8 NICs—with RoCE GDR, 4 or more queue pairs per NIC—between two servers can consistently reach 3.1 Tbps. This was measured by NVIDA’s perftest bandwidth test, and is very close (97%) to the theoretical max of 3.2 Tbps.
By pairing the accelerator, NVMe and NIC all on a single PCIe bus, users can take advantage of technologies like NVIDIA GPUDirect. This technology allows the GPU to directly communicate with the NVMe or cluster NIC without sending traffic through the CPU. There is full bi-directional bandwidth between the GPU and the NIC.
On the physical backend, IBM has been able to drive near line rate on many of our workloads. IBM Cloud, IBM Research and our partners worked closely to tune and optimize the switch buffers, isolation models and more.
Additionally, we have deployed a mechanism that logically isolates traffic flows to ensure that ECMP hash collisions are significantly reduced. Each line is optimized to deliver traffic to a specific peer at its destination. This creates a “rail-like” architecture atop our cloud backend.
AI workloads are highly susceptible to failures. If a part fails in your cluster, workloads tend to roll back to checkpoints. While checkpoint restarts are being automated and reduced, they still present a loss of work. Users define their checkpoint intervals, and depending on how aggressive they are, they could stand to lose a substantial amount of work.
It is our goal to reduce the friction of operational issues, like link failures. Our Software-Defined Network?solution allows us to deliver a more resilient cluster network. Each server is outfitted with a dual-port NVIDIA ConnectX-7 NIC, and each port runs at 2x200 Gbps.
The SDN layer composes those two ports into a single 400 Gbps VF in the NVIDIA H100 instance. The user can configure in an NVIDIA H100 cluster network if they want 1x400, 2x200 or 4x100 per cluster NIC. Irrespective of the customer's configuration, the underlying traffic is spread across the dual physical links. If or when a link issue occurs, the traffic slows within the cluster instead of failing.
We extended this principle to our backend network. If a link between two switches fails, the logical rail design re-directs the traffic accordingly. If or when a link issue occurs, instead of failing, the traffic could slow down due to reduced bandwidth capacity. If the traffic is not over 200 Gbps in the NIC, no slow-down is expected.
The NVIDIA H100 network utilizes a spine-leaf topology, with resiliency that is built to protect both layers. Whether there is an issue at the spine or leaf, network path failover occurs.
To achieve this redundancy while retaining the performance levels, IBM had to implement a technique in its aggregation layer.
Each dual-port NVIDIA ConnectX-7 NIC connects to a pair of leaf switches (8x per server). Each leaf switch connects to a set of aggregation switches. Within each aggregation switch, a Virtual Rail is created. This helps ensure that the queue pairs are balanced on the send and receive side. In our testing, this dramatically improves the performance compared to a traditional ECMP model.
Furthermore, IBM implemented a Virtual Rail Redundancy technique. Each rail is configured so that if a link failure occurs, it has an optimized failover path to another rail.
The leaf switches also use special algorithms to balance the traffic up to the aggregation switches which improves the distribution of the flows across the aggregation switch paths. Dynamic redistribution of traffic occurs when a given link from leaf to aggregation is identified as congested. The given flow will dynamically rebalance onto an open link.
These techniques have been critical in ensuring that these workloads deliver optimized performance, while retaining key resilience needs.
Each dual port NVIDIA ConnectX-7 NIC connects to a pair of leaf switches (8x per server).? Each leaf switch connects to a set of aggregation switches.?Within each aggregation switch, a Virtual Rail is created.?This ensures the queue pairs are balanced on the send and receive side.?In our testing, this dramatically improves the performance compared to a traditional ECMP model.
?Furthermore, IBM implemented a Virtual Rail Redundancy technique.?Each rail is configured so that if a link failure occurs, it has an optimized failover path to another rail.?
The leaf switches also utilize special algorithms to balance the traffic up to the aggregation switches which improves the distribution of the flows across the aggregation switch paths.? Dynamic redistribution of traffic occurs when a given link from leaf to aggregation is identified as congested.?The given flow will dynamically rebalance onto an open link.
These techniques have been critical in ensuring that these workloads deliver optimized performance, while retaining key resilience needs.
It was important to our team to keep the experience simple. We created a new Cluster Network Service to support this dedicated network. This service is intentionally light on features, as it can create cluster network subnets, assign IPs, and attach to the instances it supports.
The focus is three-fold: isolation, performance and resilience. The broader capabilities are exposed through the cloud NIC. This new service delivers cluster NICs that deliver the performance.
If users want to take advantage of cluster networking on their NVIDIA H100 instances, they are required to provision at least 8 cluster NICs. This is to help ensure proper distribution across the underlying physical infrastructure. Users can deploy either 8, 16 or 32 if they want to increase the entropy on the backend.
Creating 8 cluster NICs is a challenge, but for power users who look to create more, it would be incredibly daunting. IBM built its UI to simplify this experience.
The first step is to create the users Cluster Network. Within the Cluster Network, the user must also create a set of Cluster Network Subnets. Given these subnets tend to be remarkably similar, the UI creates them on behalf of the user. For those wanting for deeper control, they can manually configure the subnets as well.
After the subnets are created, NVIDIA H100 server instances need to be provisioned, which attach to it. From the provisioning screen, the NVIDIA H100 instance profiles can be added to a cluster network. Doing so creates the correct set of Cluster Network Attachments for your VSI.
Cluster Networks on the NVIDIA H100 instance use SR-IOV to deliver the performance. Therefore, Cluster Networks can only be added to an instance when it is being provisioned, or it is stopped.
In the example above, my cluster network has 8 subnets. Therefore, each GPU will have a corresponding 400 Gbps VF connected to the instance once provisioned.
From within the VM, the cluster network will present itself as Virtual Functions (VF) on the PCI bus. The underlying network cards used are NVIDIA ConnectX-7 NIC-based.
The PCIe topology in an H100-based instance has the following structure:
You can see a few distinct blocks. A primary section is where the boot disk, cloud NICs and data volumes are attached. Then sections for each GPU, where the corresponding Instance Storage NVMe disk, NVIDIA H100 GPU and NVIDIA ConnectX-7 VF are attached. If the user chooses additional VFs (ex 16 or 32), then each block will have 2 or 4 VFs.
To use these VFs with RDMA, users need to ensure they have installed NVIDIA MLNX_OFED drivers. These are recommended by NVIDIA to utilize their RDMA network and have tight integration with their NCCL backend.
The MLNX_OFED drivers and instructions can be downloaded from the NVIDIA InfiniBand software page. For example, for a Linux? based installation, select the Linux SW/Drivers in the left navigation panel. Then select the preferred MLNX_OFED Version, OS distribution, OS Version, and architecture.
In this example, the following options were selected:
The instructions for installing this version of MLNX_OFED are in the User Manual.
AI workloads are scaling rapidly. Models are exploding in size, and the infrastructure has needed to scale to meet the challenge. In many cases, the progression has been: small GPU to large GPU to multiple large GPUs in a system (NVLink) to multiple large GPUs in many systems (clustering).
IBM’s latest cluster network infrastructure service sets the foundation to meet that scale. This solution is for anyone who needs multiple AI nodes connected on a robust, performant backend. Stay tuned as IBM adds support for new cluster-enabled solutions going forward!
Get access to NVIDIA H100 instances on IBM Cloud.
Lead Administrator
1 个月I am working on AIX, I want to learn IBM Cloud, I need to know any training available, please let me know any platform available.
Mgr, Product Management, Network Cloud Services
2 个月Incredible accomplishment for IBM Cloud and many thanks to the Infrastructure team that made it happen. ??
Product Manager| Senior Manager | CSPO| SAAS Automation|CLOUD|HPC & AI
2 个月Well explained!!! Kudos to who all contributed and special mention of Drew Thorstensen Weiming Gu Eran Gampel Kiran Pillai Seetharami Seelam for the blog ??
Business Partner Specialist @ IBM | Business AI Strategy Expert
2 个月Thanks for the technical details on what it actually means to scale AI in a data centre efficiently. It's far more than putting some GPU's next to each other and hope for the best and probably often underestimated.
Data Center Operations
2 个月Awesome work!