In the previous chapter, Chapter 3: Attention Mechanisms (Self & Grouped Query), we built the "search engine" of our model. We learned how tokens look at each other to gather context.
However, Attention is only half the story.
Imagine a team meeting.
In this chapter, we will build the Processing part (The Feed Forward Network) and combine it with Attention to create the fundamental building block of LLMs: the Transformer Block.
Finally, we will stack these blocks on top of each other to create the full GPT Model.
Deep learning models are often called "Deep" because they consist of many layers stacked on top of each other.
The GPT Architecture is essentially a giant sandwich made of identical layers.
The data enters at the bottom, flows up through 12, 24, or even 96 identical blocks, and emerges at the top as a probability distribution for the next word.
If Attention is "looking," the Feed Forward Network (FFN) is "thinking."
After the tokens have gathered information from their neighbors via Attention, each token passes through a small neural network independently. This is where the model stores much of its factual knowledge.
It involves two linear transformations with a non-linear "activation" function in the middle. It expands the data dimension (making it wider) and then compresses it back down.
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
# 1. Expand inputs (e.g., 768 -> 3072)
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(), # The "activation" spark
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
def forward(self, x):
return self.layers(x)
Explanation:
4 * cfg["emb_dim"]: We temporarily make the network 4 times wider. This gives the model more "space" to process complex relationships.GELU: This is an activation function. Without it, the network would just be linear algebra. This adds the complexity needed to learn language.As data flows through many layers, the numbers can get very large or very small, making training unstable.
Layer Normalization is like a volume knob that automatically adjusts the signal. It centers the data (mean = 0) and scales it (variance = 1). This ensures that every layer receives clean, standardized inputs.
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
# Parameters the model learns to tweak the normalization
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
# Normalize: (x - mean) / standard_deviation
norm_x = (x - mean) / torch.sqrt(var + 1e-5)
return self.scale * norm_x + self.shift
Deep networks suffer from a problem called "vanishing gradients." As the signal travels through 12+ layers, the original input can get lost or distorted.
To fix this, we use Shortcut Connections (also called Residual Connections).
Instead of:
Output = Layer(Input)
We do:
Output = Layer(Input) + Input
We simply add the original input to the result of the layer. This creates a "highway" for information to flow unchanged from the bottom to the top if needed.
Now we combine everything into a single unit: the Transformer Block.
Each block performs two main steps:
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
# Define the components
self.att = MultiHeadAttention(...) # From Chapter 3
self.ff = FeedForward(cfg) # Defined above
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
def forward(self, x):
# Step 1: Attention with Shortcut
# Notice we Normalize BEFORE Attention (Pre-Norm)
shortcut = x
x = self.norm1(x)
x = self.att(x)
x = x + shortcut # Add original input back
# Step 2: FeedForward with Shortcut
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = x + shortcut # Add input back again
return x
Key Takeaway: The variable x represents our batch of tokens. It flows through the block, getting enriched with context (Attention) and processed (FeedForward), while the shortcut ensures we don't lose the original meaning.
Finally, we build the skyscraper. The GPTModel class holds everything together.
502) into vectors.
Here is how we implement the high-level structure in GPTModel.
__init__)
We set up the layers based on a configuration dictionary (e.g., GPT_CONFIG_124M).
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
# 1. Embeddings
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
# 2. Stack of Transformer Blocks
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
)
# 3. Output Head
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
This is the path the data takes when we run model(input_data).
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
# 1. Create Embeddings
tok_embeds = self.tok_emb(in_idx)
# Create position IDs [0, 1, 2, ..., seq_len-1]
pos_ids = torch.arange(seq_len, device=in_idx.device)
pos_embeds = self.pos_emb(pos_ids)
# Combine them
x = tok_embeds + pos_embeds
# 2. Run through all Transformer Blocks
x = self.trf_blocks(x)
# 3. Final Norm and Output Projection
x = self.final_norm(x)
logits = self.out_head(x)
return logits
In this chapter, we completed the architecture of our Large Language Model.
We now have a "brain" structure, but it is currently empty. The weights are random. If you run it now, it will output gibberish.
To make it smart, we need to teach it. We need to Train it.
Next Chapter: Training and Finetuning Loops
Generated by Code IQ