Chapter 6 · CORE

AI & Intelligence Integration

📄 06_ai___intelligence_integration.md 🏷 Core

Chapter 6: AI & Intelligence Integration

In the previous chapter, Web & Remote Content Handlers, we learned how to fetch and scrape text from the web.

But we still have a major blind spot.

What happens when the information isn't text at all? What if you have a scanned PDF (which is just a picture of text) or a PowerPoint slide containing a screenshot of a graph? Standard algorithmic parsers hit a wall here. They see "Image.png", but they have no idea what is inside it.

This is where AI & Intelligence Integration comes in.

The Problem: The "Black Box" of Pixels

Imagine you are converting a presentation about Q4 Sales.

Slide 1: Bullet points (Text). The PptxConverter handles this easily.
Slide 2: A JPEG screenshot of a spreadsheet.

To a standard code converter, Slide 2 is just a blob of bytes. It can't read the numbers in the screenshot. In the final Markdown, you would just get ![Image](image.jpg), and all that data is lost to your search index or LLM.

The Solution: The Consultant Expert

Think of this integration as hiring an Expert Consultant who sits next to your standard workers.

When the ImageConverter picks up a file and realizes, "I can't read pixels," it hands the image to the Consultant (an AI model).

The Converter asks: "What do you see in this picture?"
The Consultant (AI) replies: "I see a bar chart showing Q4 sales rising by 15%."
The Converter writes: Writes that description into the Markdown.

MarkItDown supports two main types of "Consultants":

LLMs (Large Language Models): Like GPT-4o, used for describing photos and complex visuals.
Azure Document Intelligence: Used for reading text from scanned documents and images (OCR).

Tutorial: Describing Images with GPT-4o

Let's see how to set up MarkItDown so it automatically describes images using OpenAI's GPT-4o.

Step 1: The Setup

You need an OpenAI API client. We pass this client into the MarkItDown constructor.

from markitdown import MarkItDown
from openai import OpenAI

# 1. Setup the AI Client (The Consultant)
client = OpenAI() # Requires OPENAI_API_KEY environment variable

# 2. Hire the Orchestrator with the Consultant attached
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

Here, we tell MarkItDown: "If you get stuck on a visual, ask client using the model gpt-4o."

Step 2: Converting an Image

Now, we convert a standard image file.

# 3. Convert an image file
result = md.convert("funny_cat.jpg")

# 4. Print the AI-generated description
print(result.text_content)

The Output

Instead of empty text, you receive a detailed description:

# Description:
A close-up photograph of a tabby cat looking surprised, 
with wide eyes and ears perked up.

Tutorial: Scanned Documents with Azure

Sometimes you don't need a description; you need the actual text inside a scanned document. For this, we use the Document Intelligence integration.

This replaces standard parsers with a powerful AI-driven optical character recognition (OCR) engine.

from markitdown import MarkItDown, DocumentIntelligenceConverter

md = MarkItDown()

# Register the specific AI converter
# This acts as a "Super Specialist" for documents
md.register_converter(DocumentIntelligenceConverter(
    endpoint="<your-azure-endpoint>",
    credential="<your-azure-credential>"
))

# Now, even a scanned PDF will yield selectable text
result = md.convert("scanned_contract.pdf")

How It Works Under the Hood

Let's visualize the flow when you convert an image using an LLM.

sequenceDiagram participant User participant Orchestrator participant ImgConverter as Image Converter participant AI as OpenAI (GPT-4o) User->>Orchestrator: .convert("chart.png") Orchestrator->>ImgConverter: convert(stream) ImgConverter->>ImgConverter: Read file bytes Note over ImgConverter: Checks if LLM client exists ImgConverter->>AI: Send Image + Prompt ("Describe this") AI-->>ImgConverter: "It is a chart of sales data..." ImgConverter-->>Orchestrator: Return Markdown with Description Orchestrator-->>User: Final Result

Deep Dive: The `ImageConverter`

The logic lives in markitdown/converters/_image_converter.py.

When convert() is called, it checks if llm_client was provided during configuration. If yes, it calls a helper method _get_llm_description.

# Simplified from _image_converter.py

def convert(self, file_stream, stream_info, **kwargs):
    md_content = ""
    
    # Check if the user provided an AI client
    llm_client = kwargs.get("llm_client")
    llm_model = kwargs.get("llm_model")

    if llm_client and llm_model:
        # Ask the AI for help!
        description = self._get_llm_description(
            file_stream, stream_info, client=llm_client, model=llm_model
        )
        md_content += f"\n# Description:\n{description}\n"
        
    return DocumentConverterResult(markdown=md_content)

Preparing the Image for AI

LLMs can't read a file stream directly from your disk. The image must be encoded into text (Base64) to be sent over the API.

# Simplified from _image_converter.py / _llm_caption.py

def _get_llm_description(self, file_stream, ...):
    # 1. Convert image bytes to Base64 string
    base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
    
    # 2. Construct the message for the API
    messages = [{
        "role": "user", 
        "content": [
            {"type": "text", "text": "Write a caption."},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        ]
    }]

    # 3. Send to GPT
    response = client.chat.completions.create(model=model, messages=messages)
    return response.choices[0].message.content

Deep Dive: `DocumentIntelligenceConverter`

The DocumentIntelligenceConverter (in _doc_intel_converter.py) works differently. It doesn't describe the image; it reconstructs the document structure.

It sends the file to Azure, which returns a result object containing lines, paragraphs, and tables. Interestingly, the Azure SDK often returns content already in Markdown format, which this converter cleans up.

# Simplified from _doc_intel_converter.py

def convert(self, file_stream, stream_info, **kwargs):
    # Send file to Azure
    poller = self.doc_intel_client.begin_analyze_document(
        model_id="prebuilt-layout",
        body=file_stream.read()
    )
    
    # Get the result
    result = poller.result()
    
    # Azure usually returns Markdown directly!
    return DocumentConverterResult(markdown=result.content)

Why This Matters

By integrating AI:

No Data Left Behind: Visual charts in finance reports are converted to text descriptions.
Accessibility: Images get automatic "Alt Text," making documents accessible to blind users.
Legacy Support: Scanned paper archives become searchable digital text.

Summary

In this chapter, we learned:

The Gap: Standard converters fail on pixels (images/scans).
The Fix: We can inject an AI Client (LLM or Azure) into MarkItDown.
The Flow: The converter prepares the image, sends it to the AI, and appends the AI's textual observation to the Markdown output.

We have now covered the core library, from basic text to complex AI integration. But how do we make this tool available to other AI agents?

In the final chapter, we will learn how to expose MarkItDown as a server so that other AI assistants (like Claude Desktop or IDEs) can use it as a tool.

Next Chapter: Model Context Protocol (MCP) Server

Generated by Code IQ

← Previous

Web & Remote Content Handlers

Model Context Protocol (MCP) Server