In the previous chapter, CLIP / SigLIP / InternViT (Vision), we built the Eyes of our AI. We learned how to turn pixels into mathematical vectors.
Before that, in GPT (Decoder-only), we built the Brain (Language Model).
Now, it is time to perform surgery. We are going to wire the Eyes directly into the Brain. This creates a Vision-Language Model (VLM), and the most popular architecture for this in Megatron-LM is LlaVA (Large Language-and-Vision Assistant).
Imagine you are building a chatbot for a mechanic.
Use Case: Visual Question Answering (VQA).
LlaVA is not a brand new model from scratch. It is a Connector.
It takes a pre-trained Vision Tower (like CLIP) and a pre-trained LLM (like LLaMA/GPT) and glues them together with a small component called a Projector.
The Core Idea: To the LLM, an image is just a really long sentence in a foreign language. The Projector translates that foreign language into something the LLM understands.
pretrain_vlm.pyMegatron-LM provides a script specifically for training these multimodal models.
To train LlaVA, you need to specify the configuration for both the vision part and the language part.
python pretrain_vlm.py \
--num-layers 32 \
--hidden-size 4096 \
--vision-backbone-type clip \
--vision-model-type clip \
--img-h 336 \
--img-w 336 \
--patch-dim 14 \
--data-path my_image_text_pairs
Key Arguments:
--vision-backbone-type clip: We are using the architecture from Chapter 7.--num-layers 32: This defines the size of the GPT brain.--img-h 336: The resolution of the input images.<image> What breed is this?<image> tag is replaced by hundreds of "visual tokens" representing the dog.How does the code handle two different types of data (pixels and text) at once?
llava_model.py
The logic resides in megatron/core/models/multimodal/llava_model.py. This class is a wrapper that holds the three components.
LlavaModel)The initialization shows exactly how the model is built: Eyes, Brain, and the Adapter.
# megatron/core/models/multimodal/llava_model.py
class LlavaModel(MegatronModule):
def __init__(self, config, ...):
super().__init__()
# 1. The Eyes (from Chapter 7)
self.vision_model = CLIPViTModel(config.vision_config, ...)
# 2. The Brain (from Chapter 1)
self.language_model = GPTModel(config.language_config, ...)
# 3. The Bridge (Projector)
# Usually a simple MLP (Multilayer Perceptron)
self.vision_projection = MultimodalProjector(config, ...)
This is where the magic happens. We process the image, translate it, and stick it into the text sequence.
def forward(self, images, input_ids, ...):
# Step 1: See the image
# resulting shape: [Batch, Num_Patches, Vision_Hidden_Size]
vision_embeddings = self.vision_model(images)
# Step 2: Translate to LLM language
# resulting shape: [Batch, Num_Patches, Text_Hidden_Size]
image_embeddings = self.vision_projection(vision_embeddings)
# Step 3: Combine Image and Text
# The helper function inserts image_embeddings into input_ids
combined_embeddings = self._combine_embeddings(
image_embeddings, input_ids
)
# Step 4: Ask the Brain
return self.language_model(combined_embeddings)
Why do we need the projector?
In this chapter, you learned:
LlavaModel manages this pipeline, merging visual data into the text stream.We have successfully combined Vision and Language. But what if we want to process Audio too? Or what if we want to output Images instead of just text?
For that, we need a model that can handle Many Inputs and Many Outputs.
Next Chapter: MIMO (Multimodal Input Multimodal Output)
Generated by Code IQ