Inside DeepSeek's 10,000 GPU Cluster: How to Balance Efficiency and Performance in Network Architecture

Inside DeepSeek's 10,000 GPU Cluster: How to Balance Efficiency and Performance in Network Architecture

At the start of 2025, DeepSeek, an open-source AI model from China, made a groundbreaking entry into the global AI landscape. Before DeepSeek came out, a conventional technical consensus in the AI field held that model performance was strictly proportional to computing power investment—the greater the computing power, the better the model's capabilities.?Specifically, in the context of large-scale?model training and inference. For instance, the training of xAI's Grok-3 reportedly consumed 200,000 NVIDIA GPUs, with estimated costs reaching hundreds of millions of dollars. This paradigm created a significant dilemma for many companies, as they struggled to balance model performance, training costs, and hardware scalability. DeepSeek's arrival challenged this conventional wisdom, offering a new perspective on optimizing performance while managing resource constraints.

With open-source model, algorithm innovation,?and cost optimization, DeepSeek has successfully achieved high-performance, low-cost AI model development. It is reported that the cost of Deep-Seek-V3 model training is only $5,576,000, with just 2,048 H800 graphics cards.?

There have been numerous articles that delved into the model optimization of Deepseek, this article will focus on how Deepseek maximizes cost-effectiveness in network architecture design.

1.?Hardware Infrastructure

In September 2024, Deepseek first demonstrated its first-generation cluster network architecture in a paper Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning. Although there are some differences in GPU models and network size between this cluster and the 2000 H800 described in Deepseek V3, which means they should belong to different clusters. It also shows how Deepseek is striving for cost-effectiveness on hardware infrastructure and network architecture.

GPU Server — PCIe A100

Around 2021, the dominant GPU server on the market was ?NVIDIA A100. There are two options, the PCIe A100 GPU version vs. the DGX-A100 version.

  • PCIe A100 GPU: Adopting standard PCIe 4.0 x16 interface, compatible with mainstream servers and workstation , supporting plug-and-play, providing high deployment flexibility.
  • DGX-A100: Adopting SXM4 dedicated interface, usually used in high performance computing clusters (e.g. DGX A100, HGX A100), and needs to be paired with NVIDIA-certified server systems or OEM customised mainboards.

A single PCIe-A100 server provides 8 A100 GPUs. Differs from the DGX-A100 server which provides 8+1GE NICs externally, PCIE-A100 provides only 1 CX6 NIC externally. In addition, PCIe GPU servers offer somewhat lower cost and power consumption.

Switch — NVIDIA QM8700

The NVIDIA Quantum QM8700 Series switch is a high-performance InfiniBand switch that excels in performance, power and density. Featuring 1U height, it provides 40 200Gb/s ports, 16Tb/s of non-blocking bandwidth with very low latency.


Adapter?— NVIDIA (Mellanox) CX6

The ConnectX-6 offers up to 200Gb/s per port with sub-600ns latency,supporting both InfiniBand and Ethernet.


Optical Interconnects — 200G HDR

Based on the selection of GPU servers, switches and adapters for this cluster, ?the following 200G HDR optical ?interconnect products are required:

  • Optical Module: 200G QSFP56 SR4/FR4 module (for switch-to-switch interconnect);
  • DAC: 200G QSFP56 DAC cable (short distance connection of switch-network card in rack, ≤3m);
  • AOC: 200G QSFP56 AOC ?cable (cross rack or flexible cabling scenario, ≤30m);
  • Fiber Cable: OM4 multi-mode (short-distance) or OS2 single-mode (long-distance) fiber, adapted to the transmission requirements of optical modules.

One should note that,?it is important to ensure that the entire link is compatible with original NVIDIA(Mellanox)?products to achieve 200Gb/s lossless network performance. In AI clusters, particularly in large-scale distributed training scenarios, optical modules must meet 2?core performance metrics: low Bit Error Rate (BER) and low latency. High BER can cause link jitter, negatively impacting cluster performance and large model training, which may directly disrupt company services. Low latency ensures efficient model training and fast inference response times, enhancing both network reliability and stability.

NADDOD’s InfiniBand optics feature Broadcom VCSEL and Broadcom DSP. It ensures reliable performance under demanding conditions. Powered by advanced algorithm optimization, NADDOD infiniband NDR/HDR transceivers achieve a pre-FEC BER of 1E-8 to 1E-10 and error-free transmission post-FEC, matching the performance of NVIDIA original products.

In addition, all the InfiniBand products undergo thorough testing to ensure seamless compatibility with NVIDIA hardware, firmware and software configurations. For hardware, NADDOD supports NVIDIA CX6/CX7 series NICs, Quantum/Quantum-2 series switches, DGX systems, and more. For firmware and software, NADDOD products are fully integrated with NVIDIA's InfiniBand ecosystem, including UFM.

?

2. Network?Architecture

DeepSeek used the classic Fat-Tree topology and InfiniBand technology to build its primary network architecture. The cluster consists of 10,000 A100 GPUs, including approximately 1,250 GPU compute nodes, nearly 200 storage servers, 122 200G infiniBand switches and optical interconnect products.

In?this architecture, there are 2 zones. The leaf switches of these 2 zones are directly interconnected by two 40-Port switches (Here we call it zone switch), without going through the spine switches in the zone. In other words, the two 40-Port switches are connected to 80 Leaf switches in total.

Each zone contains :

  • 20 Spine Switches and 40 Leaf Switches, with Full Mesh connectivity between Spines and Leafs.
  • 800 Nodes (including GPU Nodes and Storage Nodes, and some Management Nodes).
  • 40 Ports per Leaf Switch: 20 Ports to the Spine Switch, 1 Port to the Zone Switch in the middle, 15 or 16 Ports to the GPU Nodes, i.e. [40*15=600, 40*16=640] GPU Nodes per Zone.

(Based on the total number of storage nodes mentioned in the paper, it is assumed that on average 2 to 3 storage nodes will be connected to each leaf switch, and the storage node contains 2200 Gbps NICs. Not sure whether a storage node will be connected to a different leaf switch)

3. Cost and Performance

DeepSeek's PCIe A100 architecture demonstrates significant cost control and performance advantages over the NVIDIA DGX-A100 architecture.

First, compared to the NVIDIA DGX-A100 architecture (e.g., Table II), the PCIe A100 architecture achieves approximately 83% of the performance in the TF32 and FP16 GEMM benchmarks, at approximately 60% of the GPU cost and energy consumption. ?Furthermore, it also reduces energy consumption by 40% and reduces CO2 emissions.

Second, the DGX-A100 cluster contains a network of 10,000 access points, using a three-layer Fat-Tree topology. It requires 320 core switches, 500 spine switches, and 500 leaf switches, at a total of 1,320 switches. However, DeepSeek's two-zone integrated architecture, requires only 122 switches to meet its own clustered network requirements (as shown in Table III), a configuration that is significantly more cost effective. Even when compared to a similarly sized three-layer Fat-Tree network with 1,600 access points that includes 40 core switches and 160 spine-leaf switches (for a total of 200 switches), the two-zone integrated architecture design saves 40% of network costs.

Solution Comparion:

  • Deepseek: (40+20)*2+2=122 switches
  • PCIe Architecture + 3 Layer Fat-Tree: 1 NIC per node, 1600/20=80 leaf switch, 80 spine switch and 40 Core switch. Total: 200 Switches.
  • DGX-A100 GPU + 3-Layer Fat-Tree: Each node contains 8 GPUs with 8 backward network NICs, so 10000 GPUs (NICs) require at least 10000/(40/2)=500 40-port leaf switches, 500 40-port spine switches and 320 core switches (Consider Full Mesh, not 250 here). Total:320 switches.

Conclusion

DeepSeek's success proves that high-performance AI can be achieved by optimizing algorithms and architectures, rather than just relying on hardware stacks. Its open-source strategy further promotes openness and community-driven innovation in AI technology. This could become the new paradigm for future AI development.

Focus on AI high-performance networking, NADDOD specializes in full set of network solutions for large-scale AI training and inference. With years of experience in InfiniBand architecture design, protocol optimization, and cluster deployment, NADDOD experts can provide full-stack InfiniBand network solutions to help customers significantly improve training efficiency and reduce operation and maintenance costs. To learn more about InfiniBand?solutions?and products, please contact us at [email protected].

要查看或添加评论,请登录

NADDOD的更多文章

社区洞察

其他会员也浏览了