In the previous chapter, Web & Remote Content Handlers, we learned how to fetch and scrape text from the web.
But we still have a major blind spot.
What happens when the information isn't text at all? What if you have a scanned PDF (which is just a picture of text) or a PowerPoint slide containing a screenshot of a graph? Standard algorithmic parsers hit a wall here. They see "Image.png", but they have no idea what is inside it.
This is where AI & Intelligence Integration comes in.
Imagine you are converting a presentation about Q4 Sales.
PptxConverter handles this easily.
To a standard code converter, Slide 2 is just a blob of bytes. It can't read the numbers in the screenshot. In the final Markdown, you would just get , and all that data is lost to your search index or LLM.
Think of this integration as hiring an Expert Consultant who sits next to your standard workers.
When the ImageConverter picks up a file and realizes, "I can't read pixels," it hands the image to the Consultant (an AI model).
MarkItDown supports two main types of "Consultants":
Let's see how to set up MarkItDown so it automatically describes images using OpenAI's GPT-4o.
You need an OpenAI API client. We pass this client into the MarkItDown constructor.
from markitdown import MarkItDown
from openai import OpenAI
# 1. Setup the AI Client (The Consultant)
client = OpenAI() # Requires OPENAI_API_KEY environment variable
# 2. Hire the Orchestrator with the Consultant attached
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
Here, we tell MarkItDown: "If you get stuck on a visual, ask client using the model gpt-4o."
Now, we convert a standard image file.
# 3. Convert an image file
result = md.convert("funny_cat.jpg")
# 4. Print the AI-generated description
print(result.text_content)
Instead of empty text, you receive a detailed description:
# Description:
A close-up photograph of a tabby cat looking surprised,
with wide eyes and ears perked up.
Sometimes you don't need a description; you need the actual text inside a scanned document. For this, we use the Document Intelligence integration.
This replaces standard parsers with a powerful AI-driven optical character recognition (OCR) engine.
from markitdown import MarkItDown, DocumentIntelligenceConverter
md = MarkItDown()
# Register the specific AI converter
# This acts as a "Super Specialist" for documents
md.register_converter(DocumentIntelligenceConverter(
endpoint="<your-azure-endpoint>",
credential="<your-azure-credential>"
))
# Now, even a scanned PDF will yield selectable text
result = md.convert("scanned_contract.pdf")
Let's visualize the flow when you convert an image using an LLM.
ImageConverter
The logic lives in markitdown/converters/_image_converter.py.
When convert() is called, it checks if llm_client was provided during configuration. If yes, it calls a helper method _get_llm_description.
# Simplified from _image_converter.py
def convert(self, file_stream, stream_info, **kwargs):
md_content = ""
# Check if the user provided an AI client
llm_client = kwargs.get("llm_client")
llm_model = kwargs.get("llm_model")
if llm_client and llm_model:
# Ask the AI for help!
description = self._get_llm_description(
file_stream, stream_info, client=llm_client, model=llm_model
)
md_content += f"\n# Description:\n{description}\n"
return DocumentConverterResult(markdown=md_content)
LLMs can't read a file stream directly from your disk. The image must be encoded into text (Base64) to be sent over the API.
# Simplified from _image_converter.py / _llm_caption.py
def _get_llm_description(self, file_stream, ...):
# 1. Convert image bytes to Base64 string
base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
# 2. Construct the message for the API
messages = [{
"role": "user",
"content": [
{"type": "text", "text": "Write a caption."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
]
}]
# 3. Send to GPT
response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content
DocumentIntelligenceConverter
The DocumentIntelligenceConverter (in _doc_intel_converter.py) works differently. It doesn't describe the image; it reconstructs the document structure.
It sends the file to Azure, which returns a result object containing lines, paragraphs, and tables. Interestingly, the Azure SDK often returns content already in Markdown format, which this converter cleans up.
# Simplified from _doc_intel_converter.py
def convert(self, file_stream, stream_info, **kwargs):
# Send file to Azure
poller = self.doc_intel_client.begin_analyze_document(
model_id="prebuilt-layout",
body=file_stream.read()
)
# Get the result
result = poller.result()
# Azure usually returns Markdown directly!
return DocumentConverterResult(markdown=result.content)
By integrating AI:
In this chapter, we learned:
We have now covered the core library, from basic text to complex AI integration. But how do we make this tool available to other AI agents?
In the final chapter, we will learn how to expose MarkItDown as a server so that other AI assistants (like Claude Desktop or IDEs) can use it as a tool.
Next Chapter: Model Context Protocol (MCP) Server
Generated by Code IQ