Chapter 1 · CORE

GPT (Decoder-only)

📄 01_gpt__decoder_only_.md 🏷 Core

Chapter 1: GPT (Decoder-only)

Welcome to the world of Large Language Models (LLMs)! If you've ever used ChatGPT, Claude, or LLaMA, you have interacted with a GPT (Generative Pre-trained Transformer).

In this first chapter, we will explore the most popular architecture in modern AI: the Decoder-only model.

Motivation: The Autocomplete Engine

Imagine you are building a smart writing assistant. You want to type a phrase, and have the computer guess what comes next.

Use Case:

Input: "The quick brown fox jumps over the..."
Desired Output: "lazy"

This is the core problem GPT solves. It looks at the history of words (context) and predicts the next token. It does this repeatedly to generate paragraphs, code, or stories. In Megatron-LM, this architecture is the foundation for models like LLaMA, Qwen, DeepSeek, and Mixtral.

What is "Decoder-only"?

To understand GPT, imagine a writer who can only look backward.

When a Decoder-only model writes a sentence, it reads everything written so far, but it is strictly forbidden from "peeking" at future words during training. This is called Causal Masking.

graph LR A[Input: 'The'] --> M[GPT Model] M --> B[Predicts: 'quick'] B --> M2[GPT Model] M2 --> C[Predicts: 'brown'] style M fill:#f9f,stroke:#333 style M2 fill:#f9f,stroke:#333

Because it is optimized purely for generating the next piece of data, it is excellent at creative writing and chat.

Using Megatron-LM: `pretrain_gpt.py`

Megatron-LM is a library designed to train these models at a massive scale. The main entry point for training a standard GPT model is a script called pretrain_gpt.py.

A Simple Training Command

To start training a beginner-sized GPT model, you would run a command like this in your terminal. This setup mimics a small version of GPT-3.

python pretrain_gpt.py \
    --tensor-model-parallel-size 1 \
    --num-layers 12 \
    --hidden-size 768 \
    --num-attention-heads 12 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --data-path my_data_text_document

Explanation:

--num-layers 12: The "depth" of the brain. More layers = more reasoning capability.
--hidden-size 768: How much information represents a single word.
--seq-length 1024: The model can look back at the previous 1024 tokens to make a guess.

Understanding the Input and Output

When this script runs, what is actually happening to your data?

Input: Your text is chopped into numbers (tokens).

Example: [101, 209, 3345] (representing "The", "quick", "brown").

Output: The model outputs Logits. These are probability scores for every word in the vocabulary, indicating how likely it is to be the next word.

Under the Hood: The Internal Flow

How does Megatron-LM process these numbers internally? Let's visualize the steps when the model performs a "forward pass" (calculating predictions).

sequenceDiagram participant U as User Data participant E as Embeddings participant D as Decoder Block participant L as Loss Calc U->>E: Send Token IDs (e.g., [The, quick]) E->>D: Convert IDs to Vectors Note over D: Loops through 12+ Layers D->>D: Apply Self-Attention (Look back) D->>L: Output Prediction (Logits) L->>L: Compare with Real Next Word

Diving into the Code: `gpt_model.py`

The core logic lives in megatron/core/models/gpt/gpt_model.py. Let's look at a simplified view of how the GPTModel class is structured.

1. The Wrapper

First, the model initializes the components. It needs a way to turn words into math (Embeddings) and a stack of Transformer layers (Decoder).

# megatron/core/models/gpt/gpt_model.py

class GPTModel(MegatronModule):
    def __init__(self, config, ...):
        super().__init__()
        
        # 1. Embeddings: Turns token IDs into vectors
        self.embedding = LanguageModelEmbedding(config, ...)

        # 2. Decoder: The stack of Transformer layers
        self.decoder = TransformerBlock(config, ...)

2. The Forward Pass

When data enters the model, it flows through the embedding, then the decoder, and finally produces the output tensor.

    def forward(self, input_ids, position_ids, attention_mask):
        # Step 1: Get embeddings for input tokens
        embedding_output = self.embedding(input_ids, position_ids)

        # Step 2: Pass through the Decoder (the heavy lifting)
        # This is where the model "thinks"
        hidden_states = self.decoder(embedding_output, attention_mask)

        # Step 3: Result contains the features to predict next words
        return hidden_states

Beginner Note: The attention_mask here is crucial. It ensures position 5 can only see positions 1, 2, 3, and 4. It masks out the future.

Variants: LLaMA, Mixtral, and More

While the code above describes a standard GPT, modern variants supported by Megatron-LM (like LLaMA 2/3, Qwen, or DeepSeek) mostly change the configuration of the layers (e.g., using RMSNorm instead of LayerNorm, or RoPE instead of standard positional embeddings).

If you are using a model like Mixtral, it uses a specialized architecture called "Mixture of Experts," which we will cover in Mixture of Experts (MoE).

Summary

In this chapter, you learned:

GPT (Decoder-only) models are designed to predict the next word in a sequence.
They act like a writer who can only look backward (Causal Masking).
You use pretrain_gpt.py to train them in Megatron-LM.
The internal code is primarily a loop of Transformer layers in gpt_model.py.

But what if you don't want to generate text? What if you want to understand a whole sentence at once, looking at both the beginning and the end simultaneously (like for classification or search)?

For that, we need a different architecture.

Next Chapter: BERT (Encoder-only)

Generated by Code IQ

BERT (Encoder-only)