Mixture of Expert Models and Scale-Up
Sharada Yeluri
Sr. Director of Engineering, Silicon and Systems Technology, @ Juniper Networks
Mixture of expert (MoE) models are fast replacing the dense models in LLMs. There are obvious reasons for this. Traditional dense LLMs—like the early versions of chatGPTs —rely on a single, massive neural network where every parameter gets activated for every token or input sequence. MoE flips this script by breaking the model (in some layers) into a team of smaller, specialized sub-networks (or experts) and using a gating mechanism to decide which experts handle each part of the input.
Imagine you’re feeding an LLM a sentence: “The quantum entanglement experiment succeeded.” In a dense model, every neuron (parameter) is involved in processing it. In an MoE LLM, the gating network might send “quantum entanglement” to an expert trained in scientific jargon and “experiment succeeded” to an expert good at general reasoning, and so on. Only those chosen experts fire up, while the rest chill - saving compute resources.
In this short article, I discuss the basics of MoE, what is involved in training, and the pros and cons of using large scale-up systems to train and infer these models.
MoE Basics
Most MoE LLMs ride on the Transformer framework, the bedrock of modern LLMs. They typically consist of,
MoE Advantages
The MoE models offer massive scale without the massive cost. With only a fraction of parameters activated per token, we get the benefits of a giant model without the full computational hit. Language is usually messy—technical papers, poetry, code, slang, and multilingual chats. Dense models average out their knowledge, which can make them jacks-of-all-trades but masters of none. MoE’s experts can specialize. For example, one expert might ace Python syntax, another French idiom, and another medical terminology... The gating network routes tokens to the right expert, boosting accuracy across diverse domains. Further, similar to training, MoE keeps inference less computationally intensive as well because the active parameter count for any input token stays small, even as the total model size grows. Fewer computations mean more energy savings, and so on...
MOE Training
To understand what is involved in MoE training, let's take a closer look at Deepseek R1, the model that took the world by storm.
Deepseek V3, the 671B parameter base model used in R1, was pre-trained using 2048 H800 GPUs with 14.8 trillion tokens. This pre-training took about 55 days. After that, R1 went through Selective Fine-Tuning, RL-based reasoning, etc., which are computationally intensive but do not need trillions of tokens. Hence, they require only a small fraction of GPU hours compared to pre-training with 14.8T tokens.
The large training data set necessitates multiple model copies (data parallelism) that are trained in parallel with gradient aggregation between the copies. Within each data parallel group (model copy), the model is split into 16 pipeline stages (pipeline parallelism) and 256 experts per layer across 64 GPUs, with four experts per GPU.
For each token, a gating network learns to select up to 8 experts to process it. When performing the model partitioning, they kept each expert entirely within a GPU, eliminating tensor parallelism for the matrix multiplications within each expert. This is a significant step! If you recall, tensor parallelism involves sharding large matrix multiplications across multiple GPUs in a tensor parallel group using all-to-all communication. Each GPU computes the final result by summing the partial results it obtains from the other GPUs. This process requires high-bandwidth communication, which is quite challenging to hide efficiently behind other computations. Compilers often keep the GPUs of a tensor parallel group within a single server so that this traffic can use NVLinks to get almost nine times the bandwidth of standard ethernet/Infiniband fabric links from the GPU servers to the scale-out networks.
Compared to tensor parallelism, the bandwidth required to send tokens to smaller subsets of experts and collect activations from them is relatively less bandwidth-intensive. Additionally, several techniques exist to hide the communication of the expert layer with the parallel computation of shared expert or other layers.
领英推荐
While there are no known results on how many Blackwell (B200) GPUs this training would have required, using a conservative rule of thumb (B200 has ~2X memory and ~3x compute compared to H800), I believe less than 1K B200 GPUstare are needed to train the DeepSeek V3 model.
MOE Training Challenges/Case for Rack-Scale
Now, the next question is: Do we really need a 72 GPU scale-up domain (as in NVL72) for training MoE models of this size and complexity? It made a lot of sense to build these large clusters when the tensor parallel GPUs extended beyond a single 8-node server. However, by eliminating the tensor parallelism in favor of expert paralleism, this high-bandwidth communication bottleneck is reduced somewhat.
What would be the gains (in cost and power per training) if DeepSeek R1 had been trained in a scale-up system that allowed all experts within a layer (64 GPUs in DeepSeek R1) to exchange results using a scale-up network? It is hard for me to do theoretical analysis without expertise in model training. Once NVL72 systems are widely available for model developers, I hope to see some results soon.
However, I think having a large rack-scale system does have advantages for several reasons. In standard dense model training (without MoE), GPUs typically communicate via structured patterns such as all-to-all or scatter-gather (to send partial matrix multiplication results) or point-to-point pipelining (to pass activations between sequential model partitions). The traffic pattern is identical for each token/batch and repeats for every iteration.
In MoE models, by contrast, each forward pass involves dynamic expert routing of the tokens, as discussed above. At certain layers, a gating mechanism assigns each token to one or more “expert” networks, which may reside on different GPUs. Essentially, each token is distributed to a different subset of GPUs, and the outputs from these GPUs are gathered back to continue through the model.
This constitutes a fundamentally different traffic pattern from dense training. The communication pattern and volumes can vary from batch to batch, depending on which experts' tokens are routed to. As a result, MoE training introduces irregular, data-dependent communication. Since expert assignment is input-token-dependent, the network traffic in MoE training changes with each iteration. In a dense model, the same communication pattern (e.g., an all-reduce of a fixed-size tensor) occurs at every step. In an MoE model, one batch might send many tokens to, say, Expert #3 on GPU 5 (creating heavy traffic to that GPU), while the next batch sends far fewer to that expert. Over numerous batches, the load per GPU in the expert parallel group may average out, but at shorter timescales, the communication is less regular.
This dynamic, irregular traffic is a key difference: standard models leverage predictable communication topologies, while MoE requires the system to manage on-the-fly data routing to the GPUs hosting the selected experts. Thus, MoE training introduces extra variable inter-GPU communication stages (token scatter/gather) that do not exist in dense models. These stages involve all GPUs in an expert parallel group exchanging data with the gating function simultaneously, creating a pattern that is more communication-intensive than the localized exchange seen in pipeline parallel GPU groups. While this communication is not as bandwidth-intensive as the one required in the tensor parallel communication of partial matrix multiplication results, it can benefit from tight coupling between the GPUs. Although there are several ways to hide some of this communication with computation, having tight coupling between the experts through high-bandwidth links improves efficiency and reduces overall training time. Thus, if DeepSeek R1 were to be trained in NVL72 systems, the entire model copy could be hosted inside a single NVL2 system, and the experts would communicate with the gating network at nine times the bandwidth of traditional Ethernet links in scale-out domains, significantly reducing training time.
This debate between late scale-up systems versus building scale-out with 8-node servers sounds similar to the ongoing debate on the networking side (modular chassis vs. fixed-form factor devices). One could argue that the larger the scale of each GPU scale-up system, the fewer of these systems are necessary to build the training networks, simplifying data center design—with fewer cables, less rack space occupied, etc.
However, this does require a substantial initial investment in terms of cost and infrastructure for power delivery and cooling. NVL72 systems, with a price tag of $3-4 million, are 10-12 times more expensive than the 8-GPU servers using the same B200 GPUs.
How about Inference?
Nvidia, interestingly, claims it was able to perform Inference for deep seek R1 using a single NVIDIA HGX H200 system (8 x H200 GPUs), delivering 3,872 tokens per second. The white paper includes a small "vague" note that says the next-generation Blackwell architecture will greatly enhance test-time scaling on reasoning models like DeepSeek-R1...
While a single server could host a DeepSeek R1 model (or models like LLaMA that are in a similar size), delivering thousands of tokens per second, there is more involved in inference these days than running the model through the input tokens in an iterative fashion.
The LLM inference techniques have shifted from focusing solely on scaling model sizes to enhancing reasoning abilities during inference with inference-time computing, test-time scaling, and various other techniques. All of this necessitates many GPUs working collaboratively to produce the final results without making the users wait for long times while the model is churning. It is so painful to wait and watch while some of these models talk to themselves (called "reasoning"). In that aspect, having more GPUs tightly coupled, which lets multiple queries to the model run in parallel, allows the inference server to yield faster results and a better user experience.
However, as discussed before, these rack-scale systems are quite expensive. And if the enterprise requires more of them as part of inference network, they need to be connected through backend fabric anyways and thus do not completely eliminate the switch fabric. In that aspect, starting with fewer servers (8 GPU nodes) and adding more to the fabric as needed is cost-efficient and simpler to manage/upgrade. While hyperscalers might repurpose training clusters for inference or have no constraints to host rack-scale systems, for enterprises, no one-size-fits-all.
Ultimately, the choice depends heavily on the enterprise's needs—budget constraints, expected scalability, and the complexities of rack management. However, having the option to perform efficient inference of the state-of-the-art reasoning models within a server is certainly a game changer for everyone!
Consultant, Strategist, and Career Counsellor | Product Roadmap, Customer Solutions, Business Strategy
5 天前Thanks Sharada for a great article. Just as in real world, from what I understand the MOE is more a practical approach where it would make sense to break the info needed to be fetched from respective expert teams than to go for all which is a waste of resource and time. Infact, the real world also has Depts. with specific expertise which is what we are getting to... For your statement - "In an MoE model, one batch might send many tokens to, say, Expert #3 on GPU 5 (creating heavy traffic to that GPU), while the next batch sends far fewer to that expert." ==> just wondering, say if the Expert #N are available across all GPUs in a way that still ensures fairer traffic distribution, so in this case instead of sending many tokens to Expert #3 on GPU 5, one token is sent to Expert #3 is on every GPU possible; however I believe in this case, there is more collation of data needed post analysis vis-a-vis the queuing or wait time loss from a single expert.
Great article as always Sharada Yeluri. Thank you! MoE - by breaking the model into a team of smaller experts - should usher back in horizontal scaling, the hallmark of efficiency in data center infrastructure designs. Instead of NVL72, NVL8 or even NVL2 can be better. Couple that with an ability to better manage communication and computation overlap opens opportunities to unify scale-up and scale-out into a unified domain (mentioned as aspirations in the Deepseek R1 technical report). The argument for a unified domain is more efficient IMO than tight coupling across a larger set of GPUs, especially in the context of MoE. I also think that the argument in favor of large systems like NVL72 racks - fewer cables, less rack space occupied - sounds like an argument for mainframes, and is contrary to the fundamental shift in how [quoting you] "only a fraction of parameters (are) activated per token, we get the benefits of a giant model without the full computational hit."
Excellent article! Thank you for sharing your thoughts.
Today's Information for Tomorrow's Technology
2 周My thought to date is No we don't need AI GPU cluster scale out. This might put a dent in AI data center capex plans, but then again they are in an AI arms race. The more their competitors spend on AI, the more they spend to keep up.
Innovative Leader Driving Revenue Across Key Market Segments | Expert in Routing Solutions & Customer Engagement | Quality Champion & Diversity Advocate | Mentor & Coach
2 周Informative post in simple terms, as usual, Sharada Yeluri