In the previous chapters, we explored the "Transformer" family:
All three share the same backbone: Attention. While Attention is powerful, it has a major weakness. As the text gets longer, Attention gets much slower and uses much more memory.
Enter Mamba.
Imagine you are reading a 1,000-page novel.
Use Case: Processing extremely long sequences (like whole DNA strands or entire code repositories).
Mamba is not a Transformer; it is a State Space Model.
To understand it, think of a Conveyor Belt.
Key Benefit: Mamba has Linear Complexity. If you double the text length, it takes exactly twice as long (not 4x as long like Transformers).
pretrain_mamba.pyMegatron-LM includes a specialized script for this architecture.
Training Mamba looks similar to GPT, but under the hood, the math is completely different.
python pretrain_mamba.py \
--num-layers 24 \
--hidden-size 1024 \
--ssm-state-size 16 \
--seq-length 4096 \
--data-path my_long_text_data
New Arguments:
--ssm-state-size 16: This is the size of the "mental summary" or "memory" the model keeps for each feature. A larger state captures more details but is slower to compute.--seq-length: Mamba can handle much larger sequence lengths than standard Transformers effectively.How does Mamba process data? It uses a mechanism called Selective Scan. It decides what information to keep in its "State" and what to forget (like a filter).
mamba_model.py
The implementation lives in megatron/core/models/mamba/mamba_model.py.
The MambaModel replaces the Transformer Decoder with a stack of Mamba Layers (often called "Mixer Layers").
# megatron/core/models/mamba/mamba_model.py
class MambaModel(MegatronModule):
def __init__(self, config, ...):
super().__init__()
# 1. Embeddings: Same as GPT (Words -> Vectors)
self.embedding = LanguageModelEmbedding(config, ...)
# 2. The Backbone: A stack of Mamba Layers
# Instead of 'TransformerBlock', we use 'MambaStack'
self.decoder = MambaStack(config, ...)
Inside MambaStack, we don't have Attention Heads. We have the SSM Mixer. This is where the "State" magic happens.
class MambaLayer(MegatronModule):
def forward(self, hidden_states, ...):
# 1. Project inputs to higher dimensions
# This prepares the data for the state machine
xz = self.in_proj(hidden_states)
# 2. Run the SSM (Selective Scan)
# This updates the 'State' and calculates output linearly
out = mamba_inner_fn(xz, self.conv1d_weight, ...)
# 3. Project back to normal size
return self.out_proj(out)
Beginner Note: mamba_inner_fn is a highly optimized function (written in CUDA) that performs the "running summary" math incredibly fast.
Standard State Space Models remember everything. Mamba is Selective.
In the code, the model generates specific parameters (often called B, C, and Delta) that act like gates.
This allows Mamba to perform reasoning tasks (like "Copy the first word of the sentence to the end") that older SSMs struggled with.
In this chapter, you learned:
pretrain_mamba.py to train it.We have now covered the major architectural "shapes" (Encoder, Decoder, Encoder-Decoder, and SSM).
But what if you want to make your model massive—smart enough to know everything—but you don't want to pay the cost of computing every neuron for every word? You need a way to only use the parts of the brain relevant to the current topic.
Next Chapter: Mixture of Experts (MoE)
Generated by Code IQ