In the previous chapter, GPT (Decoder-only), we learned about models that act like writers, generating text one word at a time by looking backward.
Now, let's flip the script. Instead of a writer, imagine a Reader or an Editor. This is the BERT (Bidirectional Encoder Representations from Transformers) architecture.
GPT is great at guessing the next word, but it struggles to understand a word based on what comes after it.
Use Case: Sentiment Analysis or Missing Word Prediction.
To fill in this blank correctly, the model needs to read "The food was..." AND "...but the service was terrible." It needs to see the whole sentence at once.
This is the core problem Encoder-only models solve. They are used for classification, search engines, and question answeringโtasks where "understanding" is more important than "creative writing."
To understand BERT, imagine a detective looking at a crime scene board. The detective sees all the evidence (words) simultaneously and draws lines connecting everything to everything else.
This is called Bidirectional Attention.
In BERT, "quick" learns from "The" (past) AND "brown" (future) at the exact same time.
pretrain_bert.pyJust like GPT, Megatron-LM provides a specific script to train these models.
Here is how you launch a standard BERT training job. Notice the similarities to GPT, but with a different script name.
python pretrain_bert.py \
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--seq-length 512 \
--max-position-embeddings 512 \
--tokenizer-type BertWordPieceLowerCase \
--data-path my_data_text_document
Key Differences from GPT:
--seq-length 512: BERT usually processes shorter chunks of text (sentences or paragraphs) compared to GPT.--tokenizer-type: BERT typically uses a "WordPiece" tokenizer (though this is flexible).What happens to your text when you run this?
[MASK] position.Let's visualize the "forward pass" of a BERT model. Unlike GPT, there is no loop generating one word at a time. The entire sequence is processed in parallel.
bert_model.py
The logic for this architecture lives in megatron/core/models/bert/bert_model.py.
The BertModel class initializes the components. It looks very similar to GPT, but the way it's used is different.
# megatron/core/models/bert/bert_model.py
class BertModel(MegatronModule):
def __init__(self, config, ...):
super().__init__()
# 1. Language Model: The stack of Transformer layers
# This holds the Embeddings + Encoder
self.language_model = get_language_model(config, ...)
# 2. Binary Head: An extra output to guess:
# "Did Sentence B come after Sentence A?"
self.binary_head = get_binary_head(config, ...)
Here is the critical difference. In GPT, we used a mask to hide the future. In BERT, the attention_mask allows everyone to see everyone (usually just padding is masked).
def forward(self, input_ids, attention_mask, token_type_ids=None):
# Pass data through the bidirectional encoder
# The attention_mask here allows full visibility
lm_output = self.language_model(
input_ids,
None, # Position IDs are auto-generated
attention_mask,
token_type_ids=token_type_ids
)
# Return the features (context) for every word
return lm_output
Beginner Note: token_type_ids are unique to BERT. They tell the model "This part is Sentence A" and "This part is Sentence B".
In pretrain_bert.py, the model doesn't just return raw numbers; it usually returns two specific things needed for training:
def forward_step(data_iterator, model):
# ... (setup code) ...
# Run the model
lm_logits, sop_logits = model(tokens, attention_mask, types)
# 1. lm_logits: Predictions for the [MASK] words
# 2. sop_logits: Prediction for "Sentence Order" (Is this the next sentence?)
return lm_logits, sop_logits
In this chapter, you learned:
[MASK]), not by writing stories.pretrain_bert.py to train them.bert_model.py code allows the attention mechanism to see "future" words.We have now covered the Writer (GPT) and the Reader (BERT).
But what if you want a model that can read a complex input (like a French sentence) and write a complex output (like an English translation)? You need to combine them.
Next Chapter: T5 (Encoder-Decoder)
Generated by Code IQ