In the previous chapter, RETRO, we gave our model an "Open Book" so it could look up external text information. We expanded its memory.
Now, we are going to expand its senses.
Up until now, every model we discussed (GPT, BERT, T5) has been blind. They only understand text. In this chapter, we will build the "Eyes" of the AI using Vision Encoders. In Megatron-LM, the most common architectures for this are CLIP, SigLIP, and InternViT.
Imagine you have a brilliant writer (GPT). You hand them a photograph of a sunset and ask, "What is this?"
Because the writer only knows text, they see a grid of meaningless numbers (pixel values like [255, 0, 0]). They cannot see the "sunset."
Use Case: Image Classification or Visual Description.
This is exactly what CLIP (Contrastive Language-Image Pre-training) and its cousins (SigLIP, InternViT) do. They bridge the gap between what eyes see and what language models read.
Almost all modern vision models (including CLIP) use an architecture called ViT (Vision Transformer).
The core idea is simple: Treat an image like a sentence.
If you cut a photo into a 16x16 grid of small squares, you can feed those squares into a Transformer just like you feed words into BERT.
In Megatron-LM, these models are often used as the "Vision Tower" (the eyes) for larger multimodal robots. However, you can also train them on their own.
To set up a Vision Transformer (like the vision part of CLIP), you define the input size (image resolution) and the patch size.
# Example configuration logic (conceptual)
python pretrain_vision_model.py \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--patch-dim 14 \
--img-h 336 \
--img-w 336 \
--vision-backbone-type vit
Key Arguments:
--patch-dim 14: We cut the image into squares of 14x14 pixels.--img-h 336: The total height of the input image.--vision-backbone-type: This tells Megatron to use the standard ViT structure used by CLIP, SigLIP, etc.3 channels x 336 height x 336 width).How do we turn pixels into something a Transformer can read? The secret is Projection.
clip_vit_model.py
The logic for these vision towers lives in megatron/core/models/vision/clip_vit_model.py.
CLIPViTModel)This class acts remarkably like BERT. It's an Encoder-only transformer, but with a special start.
# megatron/core/models/vision/clip_vit_model.py
class CLIPViTModel(MegatronModule):
def __init__(self, config, ...):
super().__init__()
# 1. The Patch Embedder (The Scissors)
# Usually a Conv2d layer that turns pixels into vectors
self.conv1 = torch.nn.Conv2d(...)
# 2. Position Embeddings (The Map)
# Teaches the model where each square belongs
self.position_embeddings = torch.nn.Parameter(...)
# 3. The Brain (Transformer Encoder)
# Standard transformer blocks (Attention + MLP)
self.decoder = TransformerBlock(config, ...)
Beginner Note: Even though the variable is named self.decoder in some codebases, structurally it acts like a BERT Encoder (it sees the whole image at once).
Here is how the image travels through the model.
def forward(self, input_image):
# Step 1: Turn image into patches
# Input: [Batch, Colors, Height, Width]
# Output: [Batch, Seq_Len, Hidden_Size]
x = self.conv1(input_image)
# Step 2: Flatten and add position info
x = x.flatten(2).transpose(1, 2)
x = x + self.position_embeddings
# Step 3: Run the Transformer
# The model analyzes the relationship between patches
hidden_states = self.decoder(x, ...)
# Step 4: Return the features
return hidden_states
If the code structure is mostly the same, why the different names?
They differ in how they were trained (the "Loss Function") and the data they saw.
In Megatron-LM, clip_vit_model.py is the generic "container" that can hold the weights for any of these architectures because they all rely on the Vision Transformer structure.
In this chapter, you learned:
CLIPViTModel handles the conversion of pixels into feature vectors.Now we have a Brain (GPT) and we have Eyes (CLIP).
What happens if we wire them together? What if we feed the output of the Eyes directly into the Brain? We get a model that can chat about images.
Next Chapter: LlaVA (Vision-Language)
Generated by Code IQ