Chapter 2 ยท CORE

BERT (Encoder-only)

๐Ÿ“„ 02_bert__encoder_only_.md ๐Ÿท Core

Chapter 2: BERT (Encoder-only)

In the previous chapter, GPT (Decoder-only), we learned about models that act like writers, generating text one word at a time by looking backward.

Now, let's flip the script. Instead of a writer, imagine a Reader or an Editor. This is the BERT (Bidirectional Encoder Representations from Transformers) architecture.

Motivation: The "Fill-in-the-Blank" Expert

GPT is great at guessing the next word, but it struggles to understand a word based on what comes after it.

Use Case: Sentiment Analysis or Missing Word Prediction.

To fill in this blank correctly, the model needs to read "The food was..." AND "...but the service was terrible." It needs to see the whole sentence at once.

This is the core problem Encoder-only models solve. They are used for classification, search engines, and question answeringโ€”tasks where "understanding" is more important than "creative writing."

What is "Encoder-only"?

To understand BERT, imagine a detective looking at a crime scene board. The detective sees all the evidence (words) simultaneously and draws lines connecting everything to everything else.

This is called Bidirectional Attention.

graph TD subgraph "BERT (Encoder)" A[The] --- B[quick] B --- C[brown] A --- C end style A fill:#fff,stroke:#333 style B fill:#bbf,stroke:#333 style C fill:#fff,stroke:#333

In BERT, "quick" learns from "The" (past) AND "brown" (future) at the exact same time.

Using Megatron-LM: pretrain_bert.py

Just like GPT, Megatron-LM provides a specific script to train these models.

A Simple Training Command

Here is how you launch a standard BERT training job. Notice the similarities to GPT, but with a different script name.

python pretrain_bert.py \
    --num-layers 12 \
    --hidden-size 768 \
    --num-attention-heads 12 \
    --seq-length 512 \
    --max-position-embeddings 512 \
    --tokenizer-type BertWordPieceLowerCase \
    --data-path my_data_text_document

Key Differences from GPT:

  1. --seq-length 512: BERT usually processes shorter chunks of text (sentences or paragraphs) compared to GPT.
  2. --tokenizer-type: BERT typically uses a "WordPiece" tokenizer (though this is flexible).

Understanding the Input and Output

What happens to your text when you run this?

  1. Input (Masking): The script takes your text and randomly corrupts it.
  1. Output: The model tries to reconstruct the original word at the [MASK] position.

Under the Hood: The Internal Flow

Let's visualize the "forward pass" of a BERT model. Unlike GPT, there is no loop generating one word at a time. The entire sequence is processed in parallel.

sequenceDiagram participant D as Data Loader participant M as Masking Logic participant E as BERT Encoder participant L as Loss Calc D->>M: "Paris is the capital of France" M->>E: "Paris is the [MASK] of France" Note over E: Looks at ALL words simultaneously E->>L: Predicts "[MASK] = capital" L->>L: Compare with real word

Diving into the Code: bert_model.py

The logic for this architecture lives in megatron/core/models/bert/bert_model.py.

1. The Wrapper

The BertModel class initializes the components. It looks very similar to GPT, but the way it's used is different.

# megatron/core/models/bert/bert_model.py

class BertModel(MegatronModule):
    def __init__(self, config, ...):
        super().__init__()

        # 1. Language Model: The stack of Transformer layers
        # This holds the Embeddings + Encoder
        self.language_model = get_language_model(config, ...)

        # 2. Binary Head: An extra output to guess:
        # "Did Sentence B come after Sentence A?"
        self.binary_head = get_binary_head(config, ...)

2. The Forward Pass

Here is the critical difference. In GPT, we used a mask to hide the future. In BERT, the attention_mask allows everyone to see everyone (usually just padding is masked).

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        # Pass data through the bidirectional encoder
        # The attention_mask here allows full visibility
        lm_output = self.language_model(
            input_ids,
            None, # Position IDs are auto-generated
            attention_mask,
            token_type_ids=token_type_ids
        )
        
        # Return the features (context) for every word
        return lm_output

Beginner Note: token_type_ids are unique to BERT. They tell the model "This part is Sentence A" and "This part is Sentence B".

3. The Output Heads

In pretrain_bert.py, the model doesn't just return raw numbers; it usually returns two specific things needed for training:

def forward_step(data_iterator, model):
    # ... (setup code) ...
    
    # Run the model
    lm_logits, sop_logits = model(tokens, attention_mask, types)

    # 1. lm_logits: Predictions for the [MASK] words
    # 2. sop_logits: Prediction for "Sentence Order" (Is this the next sentence?)
    return lm_logits, sop_logits

Summary

In this chapter, you learned:

  1. BERT (Encoder-only) models are designed to understand context by looking at the whole sentence at once (Bidirectional).
  2. They are trained by filling in blanks ([MASK]), not by writing stories.
  3. You use pretrain_bert.py to train them.
  4. The bert_model.py code allows the attention mechanism to see "future" words.

We have now covered the Writer (GPT) and the Reader (BERT).

But what if you want a model that can read a complex input (like a French sentence) and write a complex output (like an English translation)? You need to combine them.

Next Chapter: T5 (Encoder-Decoder)


Generated by Code IQ