Welcome to the world of Large Language Models (LLMs)! If you've ever used ChatGPT, Claude, or LLaMA, you have interacted with a GPT (Generative Pre-trained Transformer).
In this first chapter, we will explore the most popular architecture in modern AI: the Decoder-only model.
Imagine you are building a smart writing assistant. You want to type a phrase, and have the computer guess what comes next.
Use Case:
This is the core problem GPT solves. It looks at the history of words (context) and predicts the next token. It does this repeatedly to generate paragraphs, code, or stories. In Megatron-LM, this architecture is the foundation for models like LLaMA, Qwen, DeepSeek, and Mixtral.
To understand GPT, imagine a writer who can only look backward.
When a Decoder-only model writes a sentence, it reads everything written so far, but it is strictly forbidden from "peeking" at future words during training. This is called Causal Masking.
Because it is optimized purely for generating the next piece of data, it is excellent at creative writing and chat.
pretrain_gpt.py
Megatron-LM is a library designed to train these models at a massive scale. The main entry point for training a standard GPT model is a script called pretrain_gpt.py.
To start training a beginner-sized GPT model, you would run a command like this in your terminal. This setup mimics a small version of GPT-3.
python pretrain_gpt.py \
--tensor-model-parallel-size 1 \
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--data-path my_data_text_document
Explanation:
--num-layers 12: The "depth" of the brain. More layers = more reasoning capability.--hidden-size 768: How much information represents a single word.--seq-length 1024: The model can look back at the previous 1024 tokens to make a guess.When this script runs, what is actually happening to your data?
[101, 209, 3345] (representing "The", "quick", "brown").How does Megatron-LM process these numbers internally? Let's visualize the steps when the model performs a "forward pass" (calculating predictions).
gpt_model.py
The core logic lives in megatron/core/models/gpt/gpt_model.py. Let's look at a simplified view of how the GPTModel class is structured.
First, the model initializes the components. It needs a way to turn words into math (Embeddings) and a stack of Transformer layers (Decoder).
# megatron/core/models/gpt/gpt_model.py
class GPTModel(MegatronModule):
def __init__(self, config, ...):
super().__init__()
# 1. Embeddings: Turns token IDs into vectors
self.embedding = LanguageModelEmbedding(config, ...)
# 2. Decoder: The stack of Transformer layers
self.decoder = TransformerBlock(config, ...)
When data enters the model, it flows through the embedding, then the decoder, and finally produces the output tensor.
def forward(self, input_ids, position_ids, attention_mask):
# Step 1: Get embeddings for input tokens
embedding_output = self.embedding(input_ids, position_ids)
# Step 2: Pass through the Decoder (the heavy lifting)
# This is where the model "thinks"
hidden_states = self.decoder(embedding_output, attention_mask)
# Step 3: Result contains the features to predict next words
return hidden_states
Beginner Note: The attention_mask here is crucial. It ensures position 5 can only see positions 1, 2, 3, and 4. It masks out the future.
While the code above describes a standard GPT, modern variants supported by Megatron-LM (like LLaMA 2/3, Qwen, or DeepSeek) mostly change the configuration of the layers (e.g., using RMSNorm instead of LayerNorm, or RoPE instead of standard positional embeddings).
If you are using a model like Mixtral, it uses a specialized architecture called "Mixture of Experts," which we will cover in Mixture of Experts (MoE).
In this chapter, you learned:
pretrain_gpt.py to train them in Megatron-LM.gpt_model.py.But what if you don't want to generate text? What if you want to understand a whole sentence at once, looking at both the beginning and the end simultaneously (like for classification or search)?
For that, we need a different architecture.
Next Chapter: BERT (Encoder-only)
Generated by Code IQ