In the previous chapter, Layer Normalization, we built a crucial component to stabilize the numbers flowing through our network. We learned how to force data into a standard range (mean of 0, variance of 1) so our AI can learn effectively.
But here is the scary part about AI programming: Code can run without crashing, but still be mathematically wrong.
Imagine you are building a calculator. If you program 2 + 2 and it outputs Error, you know something is wrong. But if it outputs 5, the calculator "works" (it didn't crash), but the answer is wrong.
In Neural Networks, we call these Silent Bugs. If our Normalization layer calculates the wrong average, the model will train for days and learn nothing. We won't know why until it's too late.
The Solution: Unit Tests We need to write a small script that acts like a "Quality Control Inspector." It will feed known numbers into our layer and strictly check if the output matches the math we expect.
We need to verify three specific things to trust our LayerNorm:
Let's visualize the testing process:
The most basic rule of Layer Normalization is that it should not change the shape of the data. If we feed in a sentence of 10 words, we should get out a sentence of 10 words.
We use torch.randn to create "dummy" data. These are just random numbers that simulate the vectors inside a GPT model.
import torch
from tinytorch import LayerNorm # The class we built in Chapter 3
def test_shape():
# 1. Setup: Batch=2, Sequence Length=10, Dimensions=32
x = torch.randn(2, 10, 32)
ln = LayerNorm(ndim=32, bias=True)
# 2. Run the layer
out = ln(x)
# 3. Verify: Input shape must equal Output shape
assert x.shape == out.shape
print("โ
Shape Test Passed")
test_shape()
Explanation:
torch.randn(2, 10, 32): Creates a fake batch of data.assert: This is Python's way of saying "If this is False, stop everything and scream."This is the most important test. We need to prove that our math forces the average to 0 and the spread to 1.
Computers aren't perfect at decimals. Sometimes $0$ is stored as $0.00000001$.
Because of this, we cannot check if mean == 0. We must check if mean is close to 0 (e.g., less than $1e-5$).
def test_statistics():
# 1. Create data and layer
x = torch.randn(2, 100, 32) # 100 words
ln = LayerNorm(ndim=32, bias=True)
out = ln(x)
# 2. Calculate actual mean and std of the output
# We check the last dimension (-1) because that's where we normalized
mean = out.mean(dim=-1)
std = out.std(dim=-1)
# 3. Verify: Mean should be near 0, Std near 1
# atol means "Absolute Tolerance"
assert torch.allclose(mean, torch.zeros_like(mean), atol=1e-5)
assert torch.allclose(std, torch.ones_like(std), atol=1e-5)
print("โ
Math Test Passed")
test_statistics()
Explanation:
dim=-1: We calculate the statistics for each word's vector individually.torch.allclose: A helper function that asks, "Are these numbers close enough?"(x - mean) / sqrt(var) from the previous chapter is correct.In Layer Normalization, we decided that:
If these start at random numbers, our training might explode at the very beginning.
def test_parameters():
# 1. Initialize the layer
ln = LayerNorm(ndim=32, bias=True)
# 2. Verify Gamma (Weight) is all 1s
assert torch.allclose(ln.weight, torch.ones(32))
# 3. Verify Beta (Bias) is all 0s
assert torch.allclose(ln.bias, torch.zeros(32))
print("โ
Parameter Test Passed")
test_parameters()
Explanation:
ln.weight and ln.bias directly.torch.ones and torch.zeros respectively.
In a real engineering project, we group these into a single file (often using a library like pytest). Here is how you can run all your checks at once to "green light" the component.
if __name__ == "__main__":
print("Testing Layer Normalization...")
test_shape()
test_statistics()
test_parameters()
print("๐ All LayerNorm tests passed! The component is safe to use.")
As we move forward, the components will get more complex. If we didn't test the Layer Normalization now, and later on our GPT model output garbage, we wouldn't know if the problem was in the complex brain (Attention) or the simple stabilizer (Normalization).
By verifying this now, we can rule it out as a cause of future errors.
We have successfully verified our stabilizer.
With a stable foundation, we are ready to build the "thinking" part of the Transformer block. This is where the model actually processes information.
Next, we will build the neural network layers that allow the model to "think."
Let's move to Multi-Layer Perceptron.
Generated by Code IQ