In the previous chapter, Chapter 2: Data Loading and Formatting, we turned text into batches of numbers (tokens).
However, right now, those numbers are lonely. If the model sees the number for the word "Bank", it doesn't know if it refers to a river bank or a financial bank. To solve this, the model needs to look at the surrounding words (context).
This is where Attention comes in. It is the heart of the Transformer architecture.
Attention is simply a communication mechanism. Imagine each token in a sentence is a person trying to talk to the others.
To understand how they communicate, we use a database retrieval analogy. Every token produces three vectors (lists of numbers):
In the GPT architecture (used in gpt.py), we use Multi-Head Attention. This just means we run the attention process several times in parallel.
Why? Because "Bank" might want to look for adjectives (Head 1) AND prepositions (Head 2) at the same time.
First, we use Linear layers to project our input x into these three forms.
# Simplified from gpt.py
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, num_heads):
super().__init__()
# These layers transform the input into Q, K, V
self.W_query = nn.Linear(d_in, d_out)
self.W_key = nn.Linear(d_in, d_out)
self.W_value = nn.Linear(d_in, d_out)
We check similarity by multiplying Queries and Keys (Dot Product).
def forward(self, x):
# 1. Create Q, K, V
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
# 2. Compute Match Scores (Dot Product)
# (This determines how much focus to put on each word)
attn_scores = queries @ keys.transpose(1, 2)
Note: In the code, the @ symbol performs matrix multiplication.
When training a text generator, the model reads left-to-right. It shouldn't be allowed to see the future words. We "mask" the future by setting those attention scores to negative infinity.
# 3. Mask out the future
# (We use a matrix of 1s and 0s called a 'tril' or triangle)
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Set future positions to -infinity so they become 0 probability
attn_scores.masked_fill_(mask_bool, -torch.inf)
Finally, we apply softmax (to turn scores into percentages that add up to 100%) and multiply by the Values.
# 4. Normalize scores to probabilities (0.0 to 1.0)
attn_weights = torch.softmax(attn_scores, dim=-1)
# 5. Aggregate the values based on weights
context_vec = attn_weights @ values
return context_vec
Result: The output context_vec is no longer just the word "Bank". It is a mathematical blend of "Bank" + "Money" + "Account".
Here is what happens inside the Attention block when processing the phrase "River Bank".
The standard Multi-Head Attention (used in GPT-2) gives every "Head" its own unique Key and Value.
This is great for accuracy but heavy on memory. Modern models like Llama 3 (covered in standalone-llama32.ipynb) use Grouped Query Attention (GQA).
In GQA, multiple Query heads share the same Key and Value heads.
Imagine a classroom.
This drastically reduces the memory needed to store the Keys and Values (this storage is often called the KV Cache, which we will discuss in Chapter 8: Inference Optimization (KV Cache)).
In the project file standalone-llama32.ipynb, you can see this logic in GroupedQueryAttention.
class GroupedQueryAttention(nn.Module):
def __init__(self, d_in, d_out, num_heads, num_kv_groups):
super().__init__()
# ... logic to define dimensions ...
self.group_size = num_heads // num_kv_groups
# Notice K and V are smaller than Q
self.W_query = nn.Linear(d_in, d_out)
self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim)
self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim)
def forward(self, x):
# ... generate q, k, v ...
# We repeat the keys/values so the shapes match for calculation
# Each "textbook" is virtually copied for the 4 students sharing it
keys = keys.repeat_interleave(self.group_size, dim=1)
values = values.repeat_interleave(self.group_size, dim=1)
# Proceed with standard attention...
By using repeat_interleave, we make the math work as if everyone had a textbook, but we only store the unique ones in memory.
In this chapter, we learned how the model understands context:
Now that our tokens are communicating and understanding context, we need to package this mechanism into the actual building block of the neural network.
Next Chapter: The GPT Architecture (Transformer Block)
Generated by Code IQ