In the previous chapters, we established our Core Utilities and built the Layer Normalization component to keep our numbers stable. We even verified it in Layer Normalization Tests.
Now, we are ready to build the part of the brain that "thinks."
If the Attention mechanism (which we will build later) is like looking around a room to see who is talking to whom, the Multi-Layer Perceptron (MLP) is the brain processing that information.
Imagine you are translating a sentence.
The MLP is a feed-forward network. It takes the information gathered by the model, processes it individually for each word, and extracts higher-level meaning.
To process this information effectively, the MLP uses a specific strategy:
Before we write the code, let's understand the three ingredients of our MLP sandwich.
A Linear layer is a simple matrix multiplication. It connects every input number to every output number with a weight.
n_embd) to make it bigger (size 4 * n_embd).n_embd).If we only used Linear layers, our AI would just be one giant multiplication problem. It couldn't learn complex things like sarcasm or grammar.
We need an Activation Function to introduce "curves" into our math. We use GELU (Gaussian Error Linear Unit).
Imagine a student who only studies by memorizing the textbook. They fail when the test questions are slightly different. Dropout randomly "turns off" some neurons during training. This forces the model to learn robust patterns rather than memorizing specific paths.
Let's visualize how a single word (represented as a vector of numbers) travels through the MLP.
We will build this using PyTorch's nn.Module. We need our configuration from Core Utilities to know how big the layers should be.
We define the layers. Notice how we calculate 4 * config.n_embd.
import torch.nn as nn
from tinytorch import GPTConfig
class MLP(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
# The "Up" projection: 768 -> 3072
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
# The Activation Function
self.act = nn.GELU()
# The "Down" projection: 3072 -> 768
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
# Regularization
self.dropout = nn.Dropout(config.dropout)
Explanation:
c_fc: Stands for "Current Fully Connected". This is the expansion layer.c_proj: Stands for "Current Projection". This shrinks the data back down.Now we connect the components. This is the path the data takes.
def forward(self, x):
# 1. Expand and Activate
x = self.c_fc(x)
x = self.act(x)
# 2. Project back down
x = self.c_proj(x)
# 3. Apply Dropout
x = self.dropout(x)
return x
Explanation:
x comes in with shape (Batch, Time, Channels).(Batch, Time, Channels) at the end.Let's look at how we would use this component in our main program. We treat it like a black box that processes vectors.
# 1. Setup the configuration
config = GPTConfig(n_embd=768) # Our vector size is 768
# 2. Create the MLP
mlp = MLP(config)
# 3. Create dummy data (1 batch, 10 words, 768 dim)
import torch
dummy_input = torch.randn(1, 10, 768)
# 4. Pass it through
output = mlp(dummy_input)
print(f"Input shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}")
Expected Output:
Input shape: torch.Size([1, 10, 768])
Output shape: torch.Size([1, 10, 768])
Even though the data expanded to size 3072 inside the MLP, the outside world only sees it return as 768. This allows us to stack these blocks on top of each other easily!
We have built the Multi-Layer Perceptron (MLP).
However, just like with our normalization layer, we can't just assume this works. We need to verify that the shapes are correct and that the gradients flow properly.
In the next chapter, we will write tests for this component.
Next Step: MLP Tests
Generated by Code IQ