In Chapter 5: Model Context Protocol (MCP) Server, we learned how to connect our AI agents to the outside world, allowing them to communicate with desktop applications.
However, up until now, our agents have had a major limitation: They are blind.
They can read text, but if you give them a financial report with a graph showing a market crash, or a user manual with a diagram of a machine, they miss the most important information.
This chapter introduces Multimodal Processingβgiving our AI "eyes" to see, analyze, and understand images alongside text.
Think of standard LLMs (like the ones we used in Chapter 1) as Listening to the Radio.
Multimodal AI is like Watching TV.
Imagine you have a 50-page PDF of a company's earnings report. Page 12 has a complex bar chart comparing revenue across 5 years.
To build this, we combine three specific technologies:
Let's look at how we implement this in our project vision_rag/utils.py. We are building a pipeline that takes a PDF and lets you chat with it.
First, we need to turn the PDF pages into images so the AI can look at them. We use a library called PyMuPDF (imported as fitz).
import fitz # PyMuPDF
from PIL import Image
import io
def pdf_to_images(pdf_bytes):
doc = fitz.open(stream=pdf_bytes, filetype="pdf")
images = []
# Loop through every page
for page in doc:
pix = page.get_pixmap() # Render page as image
img_data = pix.tobytes("png")
images.append(Image.open(io.BytesIO(img_data)))
return images
page.get_pixmap(): This takes the digital page and takes a high-resolution "screenshot" of it.Now that we have an image, how do we ask questions about it? We use Gemini 2.5 Flash, a model designed to be multimodal.
import requests
import base64
def gemini_vqa(api_key, image_bytes, question):
# Encode image so it can be sent over the internet
image_b64 = base64.b64encode(image_bytes).decode()
# Structure the request: Image + Text
data = {
"contents": [{
"parts": [
{"inline_data": {"mime_type": "image/png", "data": image_b64}},
{"text": question} # The user's question
]
}]
}
# ... send request to Google API ...
How to use it:
# Imagine we have an image of a chart
answer = gemini_vqa(key, chart_image, "Is the trend going up or down?")
print(answer)
# Output: "The trend is going up, peaking in Q4."
In Chapter 3: Retrieval-Augmented Generation (RAG), we learned how to find relevant text using embeddings.
Vision RAG works the same way, but with pictures.
Let's look at the code that makes the "Search" possible in vision_rag/utils.py.
We use Cohere (a model provider) to understand what is inside the image without even describing it in text.
# From vision_rag/utils.py
def get_cohere_embedding(api_key, input_data, input_type='image'):
co = cohere.Client(api_key)
if input_type == 'image':
# Cohere 'looks' at the image and returns numbers
response = co.embed(
images=[input_data],
model="embed-v4.0"
)
return np.array(response.embeddings[0])
This function compares the "Question Vector" with the "Image Vectors".
# From vision_rag/utils.py
from sklearn.metrics.pairwise import cosine_similarity
def find_most_similar(query_emb, emb_list):
# Calculate similarity between query and ALL images
similarities = cosine_similarity([query_emb], emb_list)[0]
# Find the winner (highest score)
best_idx = int(np.argmax(similarities))
return best_idx
If the user asks for "cats", and Image A contains a dog while Image B contains a cat, the embedding for Image B will be mathematically closer to the word "cats".
In this chapter, we learned:
We have now covered Agents, Tools, RAG, Orchestration, MCP, and Vision. Our AI is smart, connected, organized, and can see.
However, throughout these chapters, we have been talking about "Embeddings" and "Vectors" as if they were magic. In the final chapter, we will demystify exactly how these vectors are stored and managed at scale.
π Next Step: Vector Embeddings & Storage
Generated by Code IQ