In the previous chapter, Mamba (State Space Model), we learned how to process incredibly long books by changing how the model remembers history.
But there is another problem with Large Language Models: Size. To make a model smarter, we usually add more neurons (parameters). However, making a model bigger makes it slower and more expensive to run.
What if you could build a massive brain, but only use a tiny fraction of it for any given task? This is the concept behind Mixture of Experts (MoE).
Imagine you are building a medical AI.
Use Case: Efficiently scaling up intelligence.
Models like Mixtral and DeepSeek use this to be incredibly smart (having many experts) while remaining incredibly fast (using only a few at a time).
MoE is a feature that lives inside the Transformer block. It replaces the standard "Feed Forward" (MLP) layer.
Instead of one single Feed Forward network, an MoE layer contains:
In this diagram, only Expert 1 works. Experts 2 and 3 do nothing, saving computation.
MoE is usually trained using the standard GPT (Decoder-only) script, but with specific flags that transform the layers into MoE layers.
Here is how you turn a standard GPT training run into an MoE run (like Mixtral 8x7B).
python pretrain_gpt.py \
--num-layers 12 \
--hidden-size 768 \
--num-experts 8 \
--moe-router-topk 2 \
--expert-model-parallel-size 1 \
--data-path my_data
Key MoE Arguments:
--num-experts 8: We are creating 8 distinct specialists per layer.--moe-router-topk 2: For every word, the router picks the top 2 best experts to handle it.["The", "derivative", "of"].How does the model decide which expert to use? It uses a Router.
The Router looks at the incoming token and assigns a "score" to each expert. It then picks the winners (Top-K).
moe_layer.py
The core logic lives in megatron/core/transformer/moe/moe_layer.py.
MoELayer)This class replaces the standard MLP. It initializes the list of experts and the router.
# megatron/core/transformer/moe/moe_layer.py
class MoELayer(MegatronModule):
def __init__(self, config, ...):
super().__init__()
# 1. The Router: Decides where tokens go
self.router = TopKRouter(config=config, ...)
# 2. The Experts: A list of small MLPs
# If num_experts=8, this manages 8 sub-networks
self.experts = GroupedMLP(config=config, ...)
When data comes in, the Router assigns inputs to experts.
def forward(self, hidden_states):
# Step 1: Router decides scores and indices (winners)
# scores: How confident the router is
# indices: Which expert ID (0 to 7) to use
scores, indices = self.router(hidden_states)
# Step 2: Route the tokens to the correct experts
# This acts like a traffic controller
dispatched_input = self.token_dispatcher.token_permutation(
hidden_states, scores, indices
)
return dispatched_input
In a standard model, data flows linearly. In MoE, data is physically moved (permuted) so that all "Math" tokens end up grouped together for the "Math Expert."
# Step 3: Run the experts
# Each expert processes its assigned batch of tokens
expert_output, _ = self.experts(dispatched_input)
# Step 4: Un-route (Permute back)
# Put the tokens back in their original sentence order
output = self.token_dispatcher.token_unpermutation(
expert_output, ...
)
return output
Beginner Explanation: Think of Step 4 as shuffling a deck of cards. We dealt the cards to different players (Experts) to sign them, and now we must shuffle them back into the original order so the sentence makes sense.
In this chapter, you learned:
moe_layer.py, which shuffles data to the right experts and back again.We have now covered Decoder-only, Encoder-only, Encoder-Decoder, State Space Models, and Sparse MoEs. These are the engines of modern AI.
But sometimes, the "Input" isn't just text. Sometimes, we want our model to "cheat" by looking up information in a database instead of memorizing it all.
Generated by Code IQ