Mixture of Expert Models and Scale-Up
Courtesy: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts

Mixture of Expert Models and Scale-Up

Mixture of expert (MoE) models are fast replacing the dense models in LLMs. There are obvious reasons for this. Traditional dense LLMs—like the early versions of chatGPTs —rely on a single, massive neural network where every parameter gets activated for every token or input sequence. MoE flips this script by breaking the model (in some layers) into a team of smaller, specialized sub-networks (or experts) and using a gating mechanism to decide which experts handle each part of the input.

Imagine you’re feeding an LLM a sentence: “The quantum entanglement experiment succeeded.” In a dense model, every neuron (parameter) is involved in processing it. In an MoE LLM, the gating network might send “quantum entanglement” to an expert trained in scientific jargon and “experiment succeeded” to an expert good at general reasoning, and so on. Only those chosen experts fire up, while the rest chill - saving compute resources.

In this short article, I discuss the basics of MoE, what is involved in training, and the pros and cons of using large scale-up systems to train and infer these models.


MoE Basics

Most MoE LLMs ride on the Transformer framework, the bedrock of modern LLMs. They typically consist of,

  • Experts: These are smaller feed-forward networks (FFNs) swapped into the Transformer layers. Instead of one giant FFN per layer, there is a pool of experts (say, 8, 32, or even 128) per layer.


  • Gating Network: A lightweight system that scores each expert based on the input token or context and picks the top-k experts (often k=1 to 8) to "route" the input token to. Thus, only a fraction of the model’s parameters (e.g., 10-20%) activate per token, even if the total parameter count is in the trillions.


  • Some MOE models also have one or more shared experts. The shared expert is an additional FFN, distinct from the other "routed" experts. Unlike the routed experts, where only a few are activated for each token, the shared expert is computed for every token passing through the layer. Its output is combined with the routed experts’ outputs (typically via addition or a weighted sum) to form the final FFN result for the layer. This aims to capture common patterns across all tokens using shared experts, complementing the specialized, sparse computation of the routed experts.


Courtesy:

MoE Advantages

The MoE models offer massive scale without the massive cost. With only a fraction of parameters activated per token, we get the benefits of a giant model without the full computational hit. Language is usually messy—technical papers, poetry, code, slang, and multilingual chats. Dense models average out their knowledge, which can make them jacks-of-all-trades but masters of none. MoE’s experts can specialize. For example, one expert might ace Python syntax, another French idiom, and another medical terminology... The gating network routes tokens to the right expert, boosting accuracy across diverse domains. Further, similar to training, MoE keeps inference less computationally intensive as well because the active parameter count for any input token stays small, even as the total model size grows. Fewer computations mean more energy savings, and so on...

MOE Training

To understand what is involved in MoE training, let's take a closer look at Deepseek R1, the model that took the world by storm.

Deepseek V3, the 671B parameter base model used in R1, was pre-trained using 2048 H800 GPUs with 14.8 trillion tokens. This pre-training took about 55 days. After that, R1 went through Selective Fine-Tuning, RL-based reasoning, etc., which are computationally intensive but do not need trillions of tokens. Hence, they require only a small fraction of GPU hours compared to pre-training with 14.8T tokens.

The large training data set necessitates multiple model copies (data parallelism) that are trained in parallel with gradient aggregation between the copies. Within each data parallel group (model copy), the model is split into 16 pipeline stages (pipeline parallelism) and 256 experts per layer across 64 GPUs, with four experts per GPU.

For each token, a gating network learns to select up to 8 experts to process it. When performing the model partitioning, they kept each expert entirely within a GPU, eliminating tensor parallelism for the matrix multiplications within each expert. This is a significant step! If you recall, tensor parallelism involves sharding large matrix multiplications across multiple GPUs in a tensor parallel group using all-to-all communication. Each GPU computes the final result by summing the partial results it obtains from the other GPUs. This process requires high-bandwidth communication, which is quite challenging to hide efficiently behind other computations. Compilers often keep the GPUs of a tensor parallel group within a single server so that this traffic can use NVLinks to get almost nine times the bandwidth of standard ethernet/Infiniband fabric links from the GPU servers to the scale-out networks.

Compared to tensor parallelism, the bandwidth required to send tokens to smaller subsets of experts and collect activations from them is relatively less bandwidth-intensive. Additionally, several techniques exist to hide the communication of the expert layer with the parallel computation of shared expert or other layers.

While there are no known results on how many Blackwell (B200) GPUs this training would have required, using a conservative rule of thumb (B200 has ~2X memory and ~3x compute compared to H800), I believe less than 1K B200 GPUstare are needed to train the DeepSeek V3 model.


MOE Training Challenges/Case for Rack-Scale

Now, the next question is: Do we really need a 72 GPU scale-up domain (as in NVL72) for training MoE models of this size and complexity? It made a lot of sense to build these large clusters when the tensor parallel GPUs extended beyond a single 8-node server. However, by eliminating the tensor parallelism in favor of expert paralleism, this high-bandwidth communication bottleneck is reduced somewhat.

What would be the gains (in cost and power per training) if DeepSeek R1 had been trained in a scale-up system that allowed all experts within a layer (64 GPUs in DeepSeek R1) to exchange results using a scale-up network? It is hard for me to do theoretical analysis without expertise in model training. Once NVL72 systems are widely available for model developers, I hope to see some results soon.


Illustration of NVL72 with 72 x B200 GPUs connected through NVLinks to the switch fabric. Courtesy:

However, I think having a large rack-scale system does have advantages for several reasons. In standard dense model training (without MoE), GPUs typically communicate via structured patterns such as all-to-all or scatter-gather (to send partial matrix multiplication results) or point-to-point pipelining (to pass activations between sequential model partitions). The traffic pattern is identical for each token/batch and repeats for every iteration.

In MoE models, by contrast, each forward pass involves dynamic expert routing of the tokens, as discussed above. At certain layers, a gating mechanism assigns each token to one or more “expert” networks, which may reside on different GPUs. Essentially, each token is distributed to a different subset of GPUs, and the outputs from these GPUs are gathered back to continue through the model.

This constitutes a fundamentally different traffic pattern from dense training. The communication pattern and volumes can vary from batch to batch, depending on which experts' tokens are routed to. As a result, MoE training introduces irregular, data-dependent communication. Since expert assignment is input-token-dependent, the network traffic in MoE training changes with each iteration. In a dense model, the same communication pattern (e.g., an all-reduce of a fixed-size tensor) occurs at every step. In an MoE model, one batch might send many tokens to, say, Expert #3 on GPU 5 (creating heavy traffic to that GPU), while the next batch sends far fewer to that expert. Over numerous batches, the load per GPU in the expert parallel group may average out, but at shorter timescales, the communication is less regular.

This dynamic, irregular traffic is a key difference: standard models leverage predictable communication topologies, while MoE requires the system to manage on-the-fly data routing to the GPUs hosting the selected experts. Thus, MoE training introduces extra variable inter-GPU communication stages (token scatter/gather) that do not exist in dense models. These stages involve all GPUs in an expert parallel group exchanging data with the gating function simultaneously, creating a pattern that is more communication-intensive than the localized exchange seen in pipeline parallel GPU groups. While this communication is not as bandwidth-intensive as the one required in the tensor parallel communication of partial matrix multiplication results, it can benefit from tight coupling between the GPUs. Although there are several ways to hide some of this communication with computation, having tight coupling between the experts through high-bandwidth links improves efficiency and reduces overall training time. Thus, if DeepSeek R1 were to be trained in NVL72 systems, the entire model copy could be hosted inside a single NVL2 system, and the experts would communicate with the gating network at nine times the bandwidth of traditional Ethernet links in scale-out domains, significantly reducing training time.

This debate between late scale-up systems versus building scale-out with 8-node servers sounds similar to the ongoing debate on the networking side (modular chassis vs. fixed-form factor devices). One could argue that the larger the scale of each GPU scale-up system, the fewer of these systems are necessary to build the training networks, simplifying data center design—with fewer cables, less rack space occupied, etc.

However, this does require a substantial initial investment in terms of cost and infrastructure for power delivery and cooling. NVL72 systems, with a price tag of $3-4 million, are 10-12 times more expensive than the 8-GPU servers using the same B200 GPUs.

How about Inference?

Nvidia, interestingly, claims it was able to perform Inference for deep seek R1 using a single NVIDIA HGX H200 system (8 x H200 GPUs), delivering 3,872 tokens per second. The white paper includes a small "vague" note that says the next-generation Blackwell architecture will greatly enhance test-time scaling on reasoning models like DeepSeek-R1...

While a single server could host a DeepSeek R1 model (or models like LLaMA that are in a similar size), delivering thousands of tokens per second, there is more involved in inference these days than running the model through the input tokens in an iterative fashion.

The LLM inference techniques have shifted from focusing solely on scaling model sizes to enhancing reasoning abilities during inference with inference-time computing, test-time scaling, and various other techniques. All of this necessitates many GPUs working collaboratively to produce the final results without making the users wait for long times while the model is churning. It is so painful to wait and watch while some of these models talk to themselves (called "reasoning"). In that aspect, having more GPUs tightly coupled, which lets multiple queries to the model run in parallel, allows the inference server to yield faster results and a better user experience.

However, as discussed before, these rack-scale systems are quite expensive. And if the enterprise requires more of them as part of inference network, they need to be connected through backend fabric anyways and thus do not completely eliminate the switch fabric. In that aspect, starting with fewer servers (8 GPU nodes) and adding more to the fabric as needed is cost-efficient and simpler to manage/upgrade. While hyperscalers might repurpose training clusters for inference or have no constraints to host rack-scale systems, for enterprises, no one-size-fits-all.

Ultimately, the choice depends heavily on the enterprise's needs—budget constraints, expected scalability, and the complexities of rack management. However, having the option to perform efficient inference of the state-of-the-art reasoning models within a server is certainly a game changer for everyone!


Jyothsna Nadig

Consultant, Strategist, and Career Counsellor | Product Roadmap, Customer Solutions, Business Strategy

5 天前

Thanks Sharada for a great article. Just as in real world, from what I understand the MOE is more a practical approach where it would make sense to break the info needed to be fetched from respective expert teams than to go for all which is a waste of resource and time. Infact, the real world also has Depts. with specific expertise which is what we are getting to... For your statement - "In an MoE model, one batch might send many tokens to, say, Expert #3 on GPU 5 (creating heavy traffic to that GPU), while the next batch sends far fewer to that expert." ==> just wondering, say if the Expert #N are available across all GPUs in a way that still ensures fairer traffic distribution, so in this case instead of sending many tokens to Expert #3 on GPU 5, one token is sent to Expert #3 is on every GPU possible; however I believe in this case, there is more collation of data needed post analysis vis-a-vis the queuing or wait time loss from a single expert.

回复

Great article as always Sharada Yeluri. Thank you! MoE - by breaking the model into a team of smaller experts - should usher back in horizontal scaling, the hallmark of efficiency in data center infrastructure designs. Instead of NVL72, NVL8 or even NVL2 can be better. Couple that with an ability to better manage communication and computation overlap opens opportunities to unify scale-up and scale-out into a unified domain (mentioned as aspirations in the Deepseek R1 technical report). The argument for a unified domain is more efficient IMO than tight coupling across a larger set of GPUs, especially in the context of MoE. I also think that the argument in favor of large systems like NVL72 racks - fewer cables, less rack space occupied - sounds like an argument for mainframes, and is contrary to the fundamental shift in how [quoting you] "only a fraction of parameters (are) activated per token, we get the benefits of a giant model without the full computational hit."

Excellent article! Thank you for sharing your thoughts.

Mark Stansberry

Today's Information for Tomorrow's Technology

2 周

My thought to date is No we don't need AI GPU cluster scale out. This might put a dent in AI data center capex plans, but then again they are in an AI arms race. The more their competitors spend on AI, the more they spend to keep up.

回复
Nitin Kumar

Innovative Leader Driving Revenue Across Key Market Segments | Expert in Routing Solutions & Customer Engagement | Quality Champion & Diversity Advocate | Mentor & Coach

2 周

Informative post in simple terms, as usual, Sharada Yeluri

要查看或添加评论,请登录

Sharada Yeluri的更多文章

  • Should UEC and UAL Merge?

    Should UEC and UAL Merge?

    Is it necessary to have separate consortiums for scale-up and scale-out AI systems, or could the Ultra Ethernet…

    30 条评论
  • The Evolution of Network Security

    The Evolution of Network Security

    Since its inception, network security has undergone significant transformations, evolving from basic measures to…

    15 条评论
  • In Network Acceleration for AI/ML Workloads

    In Network Acceleration for AI/ML Workloads

    This article follows up on my previous LLM Training/Inference articles. Here, I delve deeper into the collective…

    14 条评论
  • Flexible Packet Processing Pipelines

    Flexible Packet Processing Pipelines

    As networking chips pack higher bandwidth with each new generation, the workload on their internal packet processing…

    7 条评论
  • LLM Inference - HW/SW Optimizations

    LLM Inference - HW/SW Optimizations

    This article follows my previous articles on large language model (LLM) training. In it, I explain the details of the…

    20 条评论
  • GPU Fabrics for GenAI Workloads

    GPU Fabrics for GenAI Workloads

    Introduction This article is a sequel to my previous "LLMs - the hardware connection" article. This article covers GPU…

    17 条评论
  • Liquid Cooling - The Inflection Point

    Liquid Cooling - The Inflection Point

    Introduction All the components (Optics, CPUs/GPUs, ASICs, retimers, converters) in an electronic system like…

    9 条评论
  • Chiplets - The Inevitable Transition

    Chiplets - The Inevitable Transition

    One thing that is more than certain in the semiconductor industry is that chiplets are here to stay. The shift towards…

    21 条评论
  • Large Language Models - The Hardware Connection

    Large Language Models - The Hardware Connection

    Introduction Generative AI and large language models have captivated the world in unimaginable ways. This article gives…

    27 条评论
  • Optimizing Power Consumption in High-End Routers

    Optimizing Power Consumption in High-End Routers

    Introduction The last few decades have seen exponential growth in the bandwidths of high-end routers and switches. As…

    26 条评论

社区洞察

其他会员也浏览了