Chapter 7 · CORE

CLIP / SigLIP / InternViT (Vision)

📄 07_clip___siglip___internvit__vision_.md 🏷 Core

Chapter 7: CLIP / SigLIP / InternViT (Vision)

In the previous chapter, RETRO, we gave our model an "Open Book" so it could look up external text information. We expanded its memory.

Now, we are going to expand its senses.

Up until now, every model we discussed (GPT, BERT, T5) has been blind. They only understand text. In this chapter, we will build the "Eyes" of the AI using Vision Encoders. In Megatron-LM, the most common architectures for this are CLIP, SigLIP, and InternViT.

Imagine you have a brilliant writer (GPT). You hand them a photograph of a sunset and ask, "What is this?"

Because the writer only knows text, they see a grid of meaningless numbers (pixel values like [255, 0, 0]). They cannot see the "sunset."

Use Case: Image Classification or Visual Description.

Input: An image of a cat sleeping on a sofa.
Problem: Computers see pixels; they don't see concepts.
Solution: We need a translator that turns Pixels into Vectors (mathematical meaning) that a text model can understand.

This is exactly what CLIP (Contrastive Language-Image Pre-training) and its cousins (SigLIP, InternViT) do. They bridge the gap between what eyes see and what language models read.

What is a Vision Transformer (ViT)?

Almost all modern vision models (including CLIP) use an architecture called ViT (Vision Transformer).

The core idea is simple: Treat an image like a sentence.

Text: A sentence is a sequence of words.
Image: An image is a sequence of Patches (small squares).

If you cut a photo into a 16x16 grid of small squares, you can feed those squares into a Transformer just like you feed words into BERT.

graph LR Img[Whole Image] --> Cut[Scissors] Cut --> P1[Square 1] Cut --> P2[Square 2] Cut --> P3[Square 3] P1 --> T[Transformer Encoder] P2 --> T P3 --> T T --> V[Image Vector] style Img fill:#bbf,stroke:#333 style T fill:#f9f,stroke:#333

Using Megatron-LM: The Vision Model

In Megatron-LM, these models are often used as the "Vision Tower" (the eyes) for larger multimodal robots. However, you can also train them on their own.

A Simple Configuration

To set up a Vision Transformer (like the vision part of CLIP), you define the input size (image resolution) and the patch size.

# Example configuration logic (conceptual)
python pretrain_vision_model.py \
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --patch-dim 14 \
    --img-h 336 \
    --img-w 336 \
    --vision-backbone-type vit

Key Arguments:

--patch-dim 14: We cut the image into squares of 14x14 pixels.
--img-h 336: The total height of the input image.
--vision-backbone-type: This tells Megatron to use the standard ViT structure used by CLIP, SigLIP, etc.

Understanding the Input and Output

Input: A raw image (e.g., a JPG file converted to a tensor of shape 3 channels x 336 height x 336 width).
Processing:

The model chops the image into 576 squares ($24 \times 24$ grid).
It processes these squares using layers similar to BERT (Encoder-only).

Output: A single vector (list of numbers) that represents the "meaning" of the image (e.g., "This is a cat").

Under the Hood: The Internal Flow

How do we turn pixels into something a Transformer can read? The secret is Projection.

Patchify: A Convolutional layer slides over the image. It turns every 14x14 pixel square into a single vector of numbers (an embedding).
Position Embedding: Since the Transformer has no eyes, we must tag every square: "You are the top-left corner" or "You are the bottom-right corner."
Transformer Layers: The squares "talk" to each other via Attention to figure out the shapes and objects.

sequenceDiagram participant Img as Raw Pixels participant Conv as Patch Embedder participant Pos as Position Tags participant Enc as Transformer participant Out as Vision Features Img->>Conv: 336x336 Image Note over Conv: Cuts into 14x14 squares Conv->>Pos: List of Square Vectors Pos->>Enc: Add "Location" info Note over Enc: "Top-Left" talks to "Center" Enc->>Out: Final Vector (The "Meaning")

Diving into the Code: `clip_vit_model.py`

The logic for these vision towers lives in megatron/core/models/vision/clip_vit_model.py.

1. The Wrapper (`CLIPViTModel`)

This class acts remarkably like BERT. It's an Encoder-only transformer, but with a special start.

# megatron/core/models/vision/clip_vit_model.py

class CLIPViTModel(MegatronModule):
    def __init__(self, config, ...):
        super().__init__()
        
        # 1. The Patch Embedder (The Scissors)
        # Usually a Conv2d layer that turns pixels into vectors
        self.conv1 = torch.nn.Conv2d(...)

        # 2. Position Embeddings (The Map)
        # Teaches the model where each square belongs
        self.position_embeddings = torch.nn.Parameter(...)

        # 3. The Brain (Transformer Encoder)
        # Standard transformer blocks (Attention + MLP)
        self.decoder = TransformerBlock(config, ...)

Beginner Note: Even though the variable is named self.decoder in some codebases, structurally it acts like a BERT Encoder (it sees the whole image at once).

2. The Forward Pass

Here is how the image travels through the model.

    def forward(self, input_image):
        # Step 1: Turn image into patches
        # Input: [Batch, Colors, Height, Width]
        # Output: [Batch, Seq_Len, Hidden_Size]
        x = self.conv1(input_image)
        
        # Step 2: Flatten and add position info
        x = x.flatten(2).transpose(1, 2)
        x = x + self.position_embeddings

        # Step 3: Run the Transformer
        # The model analyzes the relationship between patches
        hidden_states = self.decoder(x, ...)

        # Step 4: Return the features
        return hidden_states

3. What is the difference between CLIP, SigLIP, and InternViT?

If the code structure is mostly the same, why the different names?

They differ in how they were trained (the "Loss Function") and the data they saw.

CLIP: Learned by matching images to captions (Contrastive Loss).
SigLIP: A variation of CLIP using a specific math trick (Sigmoid Loss) to handle more data efficiently.
InternViT: A powerful vision encoder designed specifically to integrate well with Large Language Models.

In Megatron-LM, clip_vit_model.py is the generic "container" that can hold the weights for any of these architectures because they all rely on the Vision Transformer structure.

Summary

In this chapter, you learned:

Vision Transformers (ViT) process images by cutting them into small squares (patches) and treating them like words.
CLIP / SigLIP / InternViT are specific types of ViTs trained to understand visual concepts.
They serve as the "Eyes" for AI.
In Megatron-LM, CLIPViTModel handles the conversion of pixels into feature vectors.

Now we have a Brain (GPT) and we have Eyes (CLIP).

What happens if we wire them together? What if we feed the output of the Eyes directly into the Brain? We get a model that can chat about images.

Next Chapter: LlaVA (Vision-Language)

Generated by Code IQ

← Previous

RETRO

LlaVA (Vision-Language)