In the previous chapters, we looked at two distinct specialists:
In this chapter, we combine them to create the ultimate translator: T5 (Text-to-Text Transfer Transformer).
Imagine you need to translate a complex legal document from English to French.
Use Case: Translation or Summarization.
To do this, you need an architecture that first Reads the entire input (Encoder), and then uses that understanding to Write the output (Decoder).
This architecture connects the two worlds via a bridge.
pretrain_t5.pyMegatron-LM supports T5 via a specific training script. T5 is unique because it treats every problem as a text-generation problem.
Training a T5 model requires defining both the encoder and the decoder size.
python pretrain_t5.py \
--encoder-num-layers 12 \
--decoder-num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--kv-channels 64 \
--seq-length 512 \
--max-position-embeddings 512 \
--data-path my_data_collection
Key Arguments:
--encoder-num-layers: How deep the "Reader" part is.--decoder-num-layers: How deep the "Writer" part is.--kv-channels: A specific setting for how attention heads process information (often used in T5).T5 is famous for its "Text-to-Text" philosophy. You don't change the model structure for different tasks; you just change the text command.
How does data move through this two-part brain?
t5_model.py
The logic resides in megatron/core/models/T5/t5_model.py. This class orchestrates the two halves of the model.
The T5Model initializes two separate Transformer stacks.
# megatron/core/models/T5/t5_model.py
class T5Model(MegatronModule):
def __init__(self, config, ...):
super().__init__()
# 1. The Reader (Encoder)
# Looks at the full input sequence
self.encoder = T5Encoder(config=config, ...)
# 2. The Writer (Decoder)
# Generates output, looking at Encoder stats
self.decoder = T5Decoder(config=config, ...)
This is where the handshake happens. The encoder_output is passed explicitly into the decoder.
def forward(self, encoder_input_ids, decoder_input_ids, ...):
# Step 1: The Encoder reads the source text
# 'encoder_output' is the "Memory" of the input
encoder_output = self.encoder(encoder_input_ids, ...)
# Step 2: The Decoder generates the next step
# It takes its own input AND the encoder's output
decoder_output = self.decoder(
decoder_input_ids,
encoder_output=encoder_output, # <--- The Bridge
...
)
return decoder_output
Beginner Note: Notice the argument encoder_output inside the decoder call? That is Cross-Attention. That is the specific mechanism allowing the writer to peek at what the reader understood.
In this chapter, you learned:
T5Model manages these two distinct sub-models.So far, all the models we have discussed (GPT, BERT, T5) rely on the Transformer architecture and its core mechanism: Attention. Attention is powerful, but it gets slower as the text gets longer.
Is there a way to build a model that remembers history without the heavy cost of Attention?
Next Chapter: Mamba (State Space Model)
Generated by Code IQ