Chapter 3 · CORE

T5 (Encoder-Decoder)

📄 03_t5__encoder_decoder_.md 🏷 Core

Chapter 3: T5 (Encoder-Decoder)

In the previous chapters, we looked at two distinct specialists:

The Writer: GPT (Decoder-only), which generates text but cannot look ahead.
The Reader: BERT (Encoder-only), which understands full sentences but cannot generate long text easily.

In this chapter, we combine them to create the ultimate translator: T5 (Text-to-Text Transfer Transformer).

Motivation: The Professional Translator

Imagine you need to translate a complex legal document from English to French.

If you use GPT, it starts translating immediately after reading the first word. It might make mistakes because it hasn't read the end of the sentence yet.
If you use BERT, it reads the whole sentence perfectly, but it doesn't know how to write the output sequence.

Use Case: Translation or Summarization.

Input: "translate English to German: The quick brown fox."
Desired Output: "Der schnelle braune Fuchs."

To do this, you need an architecture that first Reads the entire input (Encoder), and then uses that understanding to Write the output (Decoder).

What is "Encoder-Decoder"?

This architecture connects the two worlds via a bridge.

Encoder (The Reader): Takes the input text. It can look at everything (past and future) to build a deep understanding or "memory" of the sentence.
The Bridge (Cross-Attention): This is the magic link.
Decoder (The Writer): Generates the output text one word at a time. However, unlike GPT, it looks at the Encoder's "memory" while it writes.

graph LR subgraph "Encoder (Reader)" A[Input: 'Hello'] --> E[Builds Meaning] end subgraph "Decoder (Writer)" E --> D[Decoder] D --> O[Output: 'Bonjour'] end style E fill:#bbf,stroke:#333 style D fill:#f9f,stroke:#333

Using Megatron-LM: `pretrain_t5.py`

Megatron-LM supports T5 via a specific training script. T5 is unique because it treats every problem as a text-generation problem.

A Simple Training Command

Training a T5 model requires defining both the encoder and the decoder size.

python pretrain_t5.py \
    --encoder-num-layers 12 \
    --decoder-num-layers 12 \
    --hidden-size 768 \
    --num-attention-heads 12 \
    --kv-channels 64 \
    --seq-length 512 \
    --max-position-embeddings 512 \
    --data-path my_data_collection

Key Arguments:

--encoder-num-layers: How deep the "Reader" part is.
--decoder-num-layers: How deep the "Writer" part is.
--kv-channels: A specific setting for how attention heads process information (often used in T5).

Understanding the Input and Output

T5 is famous for its "Text-to-Text" philosophy. You don't change the model structure for different tasks; you just change the text command.

Translation Input: "translate English to German: That is good." -> Target: "Das ist gut."
Summary Input: "summarize: The quick brown fox..." -> Target: "A fox jumped."
Classification Input: "cola sentence: The hay is eating the cow." -> Target: "unacceptable."

Under the Hood: The Internal Flow

How does data move through this two-part brain?

The Encoder processes the source text and creates a "Hidden State" (a mathematical summary of the input).
The Decoder receives the target text (what has been written so far) AND the Encoder's Hidden State.
The Decoder predicts the next word.

sequenceDiagram participant Src as Source Text participant Enc as Encoder participant Dec as Decoder participant Out as Output Logits Src->>Enc: "Translate: Hello" Note over Enc: Processes bidirectionally Enc->>Dec: Sends "Memory" (Hidden States) Note over Dec: Starts generating... Dec->>Dec: Looks at "Memory" + "Start Token" Dec->>Out: Predicts "Bonjour"

Diving into the Code: `t5_model.py`

The logic resides in megatron/core/models/T5/t5_model.py. This class orchestrates the two halves of the model.

1. The Wrapper

The T5Model initializes two separate Transformer stacks.

# megatron/core/models/T5/t5_model.py

class T5Model(MegatronModule):
    def __init__(self, config, ...):
        super().__init__()
        
        # 1. The Reader (Encoder)
        # Looks at the full input sequence
        self.encoder = T5Encoder(config=config, ...)

        # 2. The Writer (Decoder)
        # Generates output, looking at Encoder stats
        self.decoder = T5Decoder(config=config, ...)

2. The Forward Pass

This is where the handshake happens. The encoder_output is passed explicitly into the decoder.

    def forward(self, encoder_input_ids, decoder_input_ids, ...):
        # Step 1: The Encoder reads the source text
        # 'encoder_output' is the "Memory" of the input
        encoder_output = self.encoder(encoder_input_ids, ...)

        # Step 2: The Decoder generates the next step
        # It takes its own input AND the encoder's output
        decoder_output = self.decoder(
            decoder_input_ids,
            encoder_output=encoder_output, # <--- The Bridge
            ...
        )
        
        return decoder_output

Beginner Note: Notice the argument encoder_output inside the decoder call? That is Cross-Attention. That is the specific mechanism allowing the writer to peek at what the reader understood.

Summary

In this chapter, you learned:

T5 (Encoder-Decoder) combines the strengths of BERT (reading) and GPT (writing).
It is the go-to architecture for translation, summarization, and complex reasoning tasks.
It works by passing the Encoder's "memory" to the Decoder via Cross-Attention.
In Megatron-LM, T5Model manages these two distinct sub-models.

So far, all the models we have discussed (GPT, BERT, T5) rely on the Transformer architecture and its core mechanism: Attention. Attention is powerful, but it gets slower as the text gets longer.

Is there a way to build a model that remembers history without the heavy cost of Attention?

Next Chapter: Mamba (State Space Model)

Generated by Code IQ

← Previous

BERT (Encoder-only)

Mamba (State Space Model)