Chapter 5 · CORE

Mixture of Experts (MoE)

📄 05_mixture_of_experts__moe_.md 🏷 Core

Chapter 5: Mixture of Experts (MoE)

In the previous chapter, Mamba (State Space Model), we learned how to process incredibly long books by changing how the model remembers history.

But there is another problem with Large Language Models: Size. To make a model smarter, we usually add more neurons (parameters). However, making a model bigger makes it slower and more expensive to run.

What if you could build a massive brain, but only use a tiny fraction of it for any given task? This is the concept behind Mixture of Experts (MoE).

Motivation: The Hospital of Specialists

Imagine you are building a medical AI.

The Dense Way (Standard GPT): You have one giant doctor who is a "Jack of All Trades." Every time a patient comes in, this doctor uses their entire brain to diagnose them, whether it's a headache or a broken leg.
The Sparse Way (MoE): You build a hospital with 8 different specialists (Cardiologist, Neurologist, Podiatrist, etc.).

Use Case: Efficiently scaling up intelligence.

Input: "Calculate the integral of x^2."
Standard GPT: Activates all neurons (Literature, Math, Coding, History) to answer. Slow.
MoE: A receptionist (Router) sees the math problem and sends it only to the Math Specialist. The Literature and History specialists stay asleep. Fast.

Models like Mixtral and DeepSeek use this to be incredibly smart (having many experts) while remaining incredibly fast (using only a few at a time).

What is Mixture of Experts?

MoE is a feature that lives inside the Transformer block. It replaces the standard "Feed Forward" (MLP) layer.

Instead of one single Feed Forward network, an MoE layer contains:

Experts: A collection of many smaller Feed Forward networks.
The Router (Gating Network): A smart switch that decides which Expert should handle the current word.

graph TD Input[Input Token: 'Math'] --> Router{Router} Router -- Probability High --> E1[Expert 1: Math] Router -- Probability Low --> E2[Expert 2: History] Router -- Probability Low --> E3[Expert 3: Coding] E1 --> Output[Output] style E1 fill:#f9f,stroke:#333 style E2 fill:#eee,stroke:#333 style E3 fill:#eee,stroke:#333

In this diagram, only Expert 1 works. Experts 2 and 3 do nothing, saving computation.

Using Megatron-LM: MoE Configuration

MoE is usually trained using the standard GPT (Decoder-only) script, but with specific flags that transform the layers into MoE layers.

A Simple Training Command

Here is how you turn a standard GPT training run into an MoE run (like Mixtral 8x7B).

python pretrain_gpt.py \
    --num-layers 12 \
    --hidden-size 768 \
    --num-experts 8 \
    --moe-router-topk 2 \
    --expert-model-parallel-size 1 \
    --data-path my_data

Key MoE Arguments:

--num-experts 8: We are creating 8 distinct specialists per layer.
--moe-router-topk 2: For every word, the router picks the top 2 best experts to handle it.

Note: Even though we have 8 experts, we only pay the computing cost of 2 per token!

Understanding the Input and Output

Input: A sequence of tokens, e.g., ["The", "derivative", "of"].
Processing:

"The" might be routed to Expert 1 (Grammar).
"derivative" might be routed to Expert 4 (Math).

Output: The recombined results from the chosen experts.

Under the Hood: The Internal Flow

How does the model decide which expert to use? It uses a Router.

The Router looks at the incoming token and assigns a "score" to each expert. It then picks the winners (Top-K).

sequenceDiagram participant T as Token ("Calculate") participant R as Router participant E_Math as Expert A (Math) participant E_Lit as Expert B (Lit) participant O as Output T->>R: "Who should handle me?" Note over R: Calculates Scores R->>R: Score A: 95%, Score B: 5% R->>E_Math: Send Data (Active) R--x E_Lit: (Skipped / Inactive) E_Math->>E_Math: Process Data E_Math->>O: Return Result

Diving into the Code: `moe_layer.py`

The core logic lives in megatron/core/transformer/moe/moe_layer.py.

1. The Wrapper (`MoELayer`)

This class replaces the standard MLP. It initializes the list of experts and the router.

# megatron/core/transformer/moe/moe_layer.py

class MoELayer(MegatronModule):
    def __init__(self, config, ...):
        super().__init__()
        
        # 1. The Router: Decides where tokens go
        self.router = TopKRouter(config=config, ...)

        # 2. The Experts: A list of small MLPs
        # If num_experts=8, this manages 8 sub-networks
        self.experts = GroupedMLP(config=config, ...)

2. The Forward Pass (Routing)

When data comes in, the Router assigns inputs to experts.

    def forward(self, hidden_states):
        # Step 1: Router decides scores and indices (winners)
        # scores: How confident the router is
        # indices: Which expert ID (0 to 7) to use
        scores, indices = self.router(hidden_states)
        
        # Step 2: Route the tokens to the correct experts
        # This acts like a traffic controller
        dispatched_input = self.token_dispatcher.token_permutation(
            hidden_states, scores, indices
        )
        
        return dispatched_input

3. Execution and Permutation

In a standard model, data flows linearly. In MoE, data is physically moved (permuted) so that all "Math" tokens end up grouped together for the "Math Expert."

        # Step 3: Run the experts
        # Each expert processes its assigned batch of tokens
        expert_output, _ = self.experts(dispatched_input)
        
        # Step 4: Un-route (Permute back)
        # Put the tokens back in their original sentence order
        output = self.token_dispatcher.token_unpermutation(
            expert_output, ...
        )
        
        return output

Beginner Explanation: Think of Step 4 as shuffling a deck of cards. We dealt the cards to different players (Experts) to sign them, and now we must shuffle them back into the original order so the sentence makes sense.

Summary

In this chapter, you learned:

Mixture of Experts (MoE) allows models to be huge (many parameters) but fast (few active parameters).
It replaces the standard MLP layer with a Router and several Experts.
The Router selects the "Top-K" experts for every token.
In Megatron-LM, this is handled by moe_layer.py, which shuffles data to the right experts and back again.

We have now covered Decoder-only, Encoder-only, Encoder-Decoder, State Space Models, and Sparse MoEs. These are the engines of modern AI.

But sometimes, the "Input" isn't just text. Sometimes, we want our model to "cheat" by looking up information in a database instead of memorizing it all.

Next Chapter: RETRO

Generated by Code IQ

← Previous

Mamba (State Space Model)

RETRO