The Gaudi 3 (Intel AI) Cluster is Pretty Neat
Tony Grayson
Defense, Business, and Technology Executive | VADM Stockdale Leadership Award Recipient | Ex-Submarine Captain | LinkedIn Top Voice | Author | Top 10 Datacenter Influencer | Veteran Advocate |
AI accelerators, such as Intel's Gaudi 3, are crucial for enhancing AI training and inference capabilities, but their effectiveness greatly hinges on the architecture of the clusters they are part of. The decision by the Gaudi team to integrate Ethernet, enhanced with RDMA and RoCE protocol extensions, marks a strategic divergence from traditional InfiniBand usage.
Traditionally, InfiniBand has been the go-to choice for building high-performance computing environments due to its low latency and high throughput. However, Intel's decision to use Ethernet with RDMA and RoCE for the Gaudi 3 accelerators hinges on several strategic factors:
Each Gaudi 3 node comprises eight-way configurations capable of delivering up to 14.7 petaflops at FP8 precision. These nodes utilize OSFP links that are essential for high-speed data transmission, necessitating the use of retimers to handle doubled speeds effectively. The internal configuration of the Gaudi 3 includes 24 ports, with 21 dedicated to creating a dense, all-to-all network essential for high-bandwidth communications between accelerators.
When scaling up, these nodes are grouped into sub-clusters. A typical sub-cluster might consist of sixteen Gaudi 3 nodes. The networking within these sub-clusters employs high-performance switches like Broadcom's Tomahawk 5 StrataXGS, which supports up to 51.2 Tb/sec. These switches are divided into two halves: one interfacing directly with the servers at 800 Gb/sec and the other connecting upwards to the spine network, ensuring robust scalability and redundancy.
领英推荐
For larger deployments, the network architecture expands into multiple sub-clusters. To scale to 4,096 Gaudi 3 accelerators across 512 server nodes, the design links 32 sub-clusters. This is achieved by interconnecting 96-leaf switches with three banks of sixteen spine switches. This arrangement allows for multiple paths for inter-node communication, which is critical for maintaining high levels of data integrity and system availability across extensive computing tasks.
In the context of inference, where rapid response times are crucial, integrating Ethernet with RDMA and RoCE in Gaudi 3 accelerators significantly enhances data throughput and latency, directly impacting the performance of real-time AI applications. This network setup allows efficient data exchange across nodes, which is crucial for deploying models that require real-time inference, like those used in video analysis and online transaction systems.
Furthermore, the Gaudi 3 has demonstrated significant advantages over Nvidia's H100 in performance comparisons. For instance, in training complex AI models like Llama2 and GPT-3, the Gaudi 3 shows improvements ranging from 1.4X to 1.7X. These gains underscore the effective use of Ethernet in enhancing data flow between nodes, which is critical for tasks that require extensive data sharing, such as training large AI models.
By integrating advanced Ethernet capabilities instead of relying on InfiniBand, Intel's Gaudi 3 AI accelerators reflect a strategic adaptation to modern data centers' evolving demands and infrastructures. This approach ensures compatibility with broader network environments and enhances the cost-effectiveness and scalability of AI operations, paving the way for more widespread adoption and deployment of AI technologies.
From an article elsewhere: "Intel's Gaudi 3 may be a potentially attractive alternative to the H100 if Intel can hit an ideal price (which Intel has not provided, but an H100 reportedly costs around $30,000–$40,000) and maintain adequate production. AMD also manufactures a competitive range of AI chips, such as the AMD Instinct MI300 Series, that sell for around $10,000–$15,000." https://arstechnica.com/information-technology/2024/04/intels-gaudi-3-ai-accelerator-chip-may-give-nvidias-h100-a-run-for-the-money/ Those prices need to be chopped by AT LEAST an order of magnitude if there will be any hope of widespread involvement from truly academic researchers; and not just the academic PI [Principal Investigators] at the top of the grant-funding pile) Absent that, this societal impact from this tech will be exclusively decided in well-funded tech firms. Can anyone think of any adverse consequences happening from tech giants deciding widespread societal impact?