In the previous chapter, Chapter 5: Training and Finetuning Loops, we taught our model how to learn. We used a classic architecture based on GPT-2.
However, if you look at modern state-of-the-art models like Llama 3.2 (Meta) or Qwen 3 (Alibaba), they don't look exactly like the classic GPT-2. They have been "tuned up."
Think of the GPT architecture as a classic internal combustion engine.
In this chapter, we will explore these specific upgrades: RMSNorm, SwiGLU, and RoPE.
Both Llama and Qwen share a very similar structure. They modify the standard Transformer Block in three key ways to make training more stable and inference faster.
LayerNorm to RMSNorm.GELU to SwiGLU.Learned Embeddings to RoPE (Rotary Positional Embeddings).Let's break these down one by one.
In Chapter 4, we used Layer Normalization. It centers the numbers (subtracts the average) and scales them (divides by variance).
Root Mean Square Normalization (RMSNorm) simplifies this. Researchers found that "centering" the data isn't actually necessary. We only need to "scale" it to keep the numbers from exploding.
It requires less math (no subtraction of the mean), making it slightly faster and more computationally efficient.
Here is how RMSNorm looks in our Llama/Qwen implementation.
class RMSNorm(nn.Module):
def __init__(self, emb_dim, eps=1e-6):
super().__init__()
self.eps = eps
# We learn a scaling parameter, but NO shift parameter
self.scale = nn.Parameter(torch.ones(emb_dim))
def forward(self, x):
# Calculate the "magnitude" of the data
variance = x.pow(2).mean(dim=-1, keepdim=True)
# Normalize inputs and apply learned scale
norm_x = x * torch.rsqrt(variance + self.eps)
return norm_x * self.scale
Key Difference: Notice we don't calculate x - mean. We just divide by the root mean square of the values.
In the classic Feed Forward Network (the "Brain" of the block), we used an expansion pattern:
Input -> Linear -> Activation (GELU) -> Linear -> Output
Modern models use SwiGLU. This adds a third "Gate" layer.
Imagine a security checkpoint.
Activation).Value). The other path acts as a gate (Gate). The Gate decides how much of the Value gets through by multiplying them together.This creates a network with 3 linear layers instead of 2.
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
# 3 Layers instead of 2
self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], bias=False)
self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], bias=False)
self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], bias=False)
def forward(self, x):
# The "Gate" (fc1) controls the "Value" (fc2)
# silu is the "Swish" activation function
x_gate = nn.functional.silu(self.fc1(x))
x_value = self.fc2(x)
# Element-wise multiplication
return self.fc3(x_gate * x_value)
Result: This structure allows the model to learn more complex patterns because it can selectively "silence" or "amplify" specific neurons using the gating mechanism.
This is the most significant change.
In standard GPT, we assigned a unique vector to "Position 1" and added it to the word. In RoPE, we don't add a vector. We rotate the word vector itself.
Imagine the token vector is an arrow on a compass.
When the Attention mechanism compares two tokens (Query and Key), it compares the angle between the arrows. This allows the model to understand Relative Position ("These words are 5 steps apart") much better than Absolute Position ("This is the 500th word").
Implementing RoPE involves some complex trigonometry, but using it is straightforward. We pre-calculate cos and sin tables and apply them.
def apply_rope(x, cos, sin):
# Split the vector into two halves
head_dim = x.shape[-1]
x1 = x[..., : head_dim // 2]
x2 = x[..., head_dim // 2 :]
# Rotate the vector using the rotation matrix formula
rotated = torch.cat((-x2, x1), dim=-1)
x_rotated = (x * cos) + (rotated * sin)
return x_rotated
Note: In the code files llama3.py and standalone-qwen3.ipynb, you will see helper functions compute_rope_params that generate the cos and sin grids.
While Llama and Qwen are 90% identical, Qwen 3 introduces a small but impactful tweak called QK Norm.
In Chapter 3: Attention Mechanisms, we learned about Queries (Q) and Keys (K). If these vectors get too large, the attention scores explode, causing training instability.
Qwen 3 applies RMSNorm to the Queries and Keys before calculating the attention score.
# Inside Qwen's Attention Block
if self.qk_norm:
queries = self.q_norm(queries)
keys = self.k_norm(keys)
# Then calculate attention scores...
attn_scores = queries @ keys.transpose(2, 3)
This ensures that the "search" mechanism in attention remains stable, even for massive models with billions of parameters.
Let's visualize the path of data through a modern Llama/Qwen block. It uses Grouped Query Attention (from Chapter 3) alongside our new upgrades.
Finally, let's look at how we initialize the Llama3Model class. It looks very similar to our GPT model, but uses the upgraded components.
class Llama3Model(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
# Stack of Modern Transformer Blocks
self.trf_blocks = nn.ModuleList(
[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
)
self.final_norm = nn.RMSNorm(cfg["emb_dim"])
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
# Pre-compute the RoPE rotation tables (Cos and Sin)
cos, sin = compute_rope_params(..., cfg["context_length"])
self.register_buffer("cos", cos)
self.register_buffer("sin", sin)
def forward(self, in_idx):
x = self.tok_emb(in_idx)
# Pass RoPE tables to every block
for block in self.trf_blocks:
x = block(x, self.cos, self.sin)
x = self.final_norm(x)
return self.out_head(x)
Key Takeaway: Notice that we calculate cos and sin once at the beginning and pass them into every block. This saves time so each block doesn't have to recalculate the rotation angles.
In this chapter, we upgraded our "Classic Car" to a "Modern Supercar."
These changes are the industry standard for 2024/2025.
However, as models get bigger (like Llama 3 70B or Qwen 72B), running them becomes incredibly expensive. What if we could have a huge model, but only use a tiny part of its brain for each token?
This leads us to the next major efficiency breakthrough: Mixture of Experts.
Next Chapter: Mixture of Experts (MoE)
Generated by Code IQ