Chapter 8 · CORE

LlaVA (Vision-Language)

📄 08_llava__vision_language_.md 🏷 Core

Chapter 8: LlaVA (Vision-Language)

In the previous chapter, CLIP / SigLIP / InternViT (Vision), we built the Eyes of our AI. We learned how to turn pixels into mathematical vectors.

Before that, in GPT (Decoder-only), we built the Brain (Language Model).

Now, it is time to perform surgery. We are going to wire the Eyes directly into the Brain. This creates a Vision-Language Model (VLM), and the most popular architecture for this in Megatron-LM is LlaVA (Large Language-and-Vision Assistant).

Motivation: The AI that can "See"

Imagine you are building a chatbot for a mechanic.

GPT (Brain only): You type, "My bike chain is loose." It replies, "Tighten the rear axle nut." (It guesses based on text).
CLIP (Eyes only): You upload a photo. It classifies it as "Bicycle." (It can't give instructions).
LlaVA (Brain + Eyes): You upload a photo of the rusty chain. The model sees the rust and the loose tension. It replies, "I see the chain is rusted and slack. First, apply degreaser..."

Use Case: Visual Question Answering (VQA).

Input: An image + A question ("What is funny about this image?").
Output: A text response describing the visual humor.

What is LlaVA?

LlaVA is not a brand new model from scratch. It is a Connector.

It takes a pre-trained Vision Tower (like CLIP) and a pre-trained LLM (like LLaMA/GPT) and glues them together with a small component called a Projector.

Vision Encoder: Converts image to features.
Projector (The Adapter): Translates "Image Features" into "Word Embeddings."
LLM: Reads the translated image tokens as if they were just words and generates the answer.

graph LR Img[Image: Cat] --> Vision[Vision Encoder] Vision --> Feat[Image Features] Feat --> Proj[Projector] Proj --> Tokens[Visual Tokens] Text[Question: 'What is this?'] --> LLM[LLM Backbone] Tokens --> LLM LLM --> Ans[Answer: 'A cat'] style Vision fill:#bbf,stroke:#333 style LLM fill:#f9f,stroke:#333 style Proj fill:#ff9,stroke:#333

The Core Idea: To the LLM, an image is just a really long sentence in a foreign language. The Projector translates that foreign language into something the LLM understands.

Using Megatron-LM: `pretrain_vlm.py`

Megatron-LM provides a script specifically for training these multimodal models.

A Simple Training Command

To train LlaVA, you need to specify the configuration for both the vision part and the language part.

python pretrain_vlm.py \
    --num-layers 32 \
    --hidden-size 4096 \
    --vision-backbone-type clip \
    --vision-model-type clip \
    --img-h 336 \
    --img-w 336 \
    --patch-dim 14 \
    --data-path my_image_text_pairs

Key Arguments:

--vision-backbone-type clip: We are using the architecture from Chapter 7.
--num-layers 32: This defines the size of the GPT brain.
--img-h 336: The resolution of the input images.

Understanding the Input and Output

Input:

Image: A picture of a dog.
Text: <image> What breed is this?

Process: The <image> tag is replaced by hundreds of "visual tokens" representing the dog.
Output: "This looks like a Golden Retriever."

Under the Hood: The Internal Flow

How does the code handle two different types of data (pixels and text) at once?

The Vision Tower processes the image independently.
The Projector (usually a simple 2-layer neural network) reshapes the vision output to match the size of the text embeddings.
The system Concatenates (joins) the visual tokens and text tokens into one long sequence.
The LLM processes the sequence.

sequenceDiagram participant U as User (Image + Text) participant V as Vision Tower participant P as Projector participant L as LLM Brain U->>V: Send Image Pixels V->>P: Send Raw Image Embeddings Note over P: Translates "Pixel-speak" to "Word-speak" P->>L: Send "Visual Tokens" U->>L: Send Text Tokens ("Describe this") Note over L: Combines: [Visual Tokens, Text Tokens] L->>L: Process as ONE sequence L->>U: Generate Answer

Diving into the Code: `llava_model.py`

The logic resides in megatron/core/models/multimodal/llava_model.py. This class is a wrapper that holds the three components.

1. The Wrapper (`LlavaModel`)

The initialization shows exactly how the model is built: Eyes, Brain, and the Adapter.

# megatron/core/models/multimodal/llava_model.py

class LlavaModel(MegatronModule):
    def __init__(self, config, ...):
        super().__init__()
        
        # 1. The Eyes (from Chapter 7)
        self.vision_model = CLIPViTModel(config.vision_config, ...)

        # 2. The Brain (from Chapter 1)
        self.language_model = GPTModel(config.language_config, ...)

        # 3. The Bridge (Projector)
        # Usually a simple MLP (Multilayer Perceptron)
        self.vision_projection = MultimodalProjector(config, ...)

2. The Forward Pass

This is where the magic happens. We process the image, translate it, and stick it into the text sequence.

    def forward(self, images, input_ids, ...):
        # Step 1: See the image
        # resulting shape: [Batch, Num_Patches, Vision_Hidden_Size]
        vision_embeddings = self.vision_model(images)

        # Step 2: Translate to LLM language
        # resulting shape: [Batch, Num_Patches, Text_Hidden_Size]
        image_embeddings = self.vision_projection(vision_embeddings)

        # Step 3: Combine Image and Text
        # The helper function inserts image_embeddings into input_ids
        combined_embeddings = self._combine_embeddings(
            image_embeddings, input_ids
        )

        # Step 4: Ask the Brain
        return self.language_model(combined_embeddings)

3. The "Projector"

Why do we need the projector?

The Vision Model might output vectors of size 1024.
The Language Model might expect vectors of size 4096.
The Projector is just a matrix multiplication that stretches size 1024 to 4096 so they fit together.

Summary

In this chapter, you learned:

LlaVA is a Vision-Language Model (VLM) that connects a Vision Encoder to an LLM.
It allows an AI to answer questions about images.
It uses a Projector to translate visual features into "visual tokens" that the LLM can understand.
In Megatron-LM, LlavaModel manages this pipeline, merging visual data into the text stream.

We have successfully combined Vision and Language. But what if we want to process Audio too? Or what if we want to output Images instead of just text?

For that, we need a model that can handle Many Inputs and Many Outputs.

Next Chapter: MIMO (Multimodal Input Multimodal Output)

Generated by Code IQ

← Previous

CLIP / SigLIP / InternViT (Vision)

MIMO (Multimodal Input Multimodal Output)