In the previous chapter, The DocumentConverter Interface, we learned how to build "Specialists" (Converters) that can handle specific jobs.
But if you dump a pile of files on the desk—some labeled correctly, some missing extensions, and some corrupted—how does the system know which Specialist to call? You wouldn't send a PDF to an Image Converter just because someone renamed it image.png.
This is where Stream Identification & Routing comes in.
Imagine you are processing email attachments. You receive a file named data.tmp.
If MarkItDown relied solely on file extensions, it would fail immediately. It needs to look inside the file to understand what it actually is.
Think of the internal logic of MarkItDown as a Triage Nurse in a hospital emergency room.
To make these decisions, MarkItDown creates a standardized "ID Card" for every file called StreamInfo.
Before any conversion happens, the system gathers as much evidence as possible and fills out this card.
# Simplified view of the StreamInfo object
@dataclass
class StreamInfo:
mimetype: Optional[str] = None # e.g., "application/pdf"
extension: Optional[str] = None # e.g., ".pdf"
charset: Optional[str] = None # e.g., "utf-8"
filename: Optional[str] = None # e.g., "report.pdf"
The goal of the Identification phase is to fill this card with accurate data so the Converters can make an informed decision.
How does MarkItDown know a file is a PDF if it doesn't have a .pdf extension?
It uses a tool called Magika (developed by Google). Magika uses AI to look at the raw binary bytes of a file to determine its type with high precision.
Here is what happens inside MarkItDown._get_stream_info_guesses:
.txt but Magika says "Excel," MarkItDown creates two ID cards (guesses).It prioritizes the "Deep Inspection" (Guess 1) usually, ensuring that even misnamed files get converted correctly.
Once the system has its "Guesses" (the ID Cards), it walks down the hallway of Specialists (Converters) and asks: "Can you help?"
This is the Routing phase.
Let's look at the actual code that powers this logic.
_markitdown.py)
The method _get_stream_info_guesses is the Triage Nurse. It combines the file extension (if known) with Magika's prediction.
def _get_stream_info_guesses(self, file_stream, base_guess):
guesses = []
# Ask Magika what the stream contains
result = self._magika.identify_stream(file_stream)
# If Magika is confident, create a StreamInfo based on its finding
if result.status == "ok":
guesses.append(StreamInfo(
mimetype=result.prediction.output.mime_type,
# ... other details ...
))
return guesses
_convert)Once we have the guesses, we iterate through the converters. This logic handles the routing.
# Simplified from _markitdown.py
def _convert(self, file_stream, stream_info_guesses, **kwargs):
# Try every guess (e.g., first try treating it as Excel, then as Text)
for stream_info in stream_info_guesses:
# Ask every specialist
for registration in self._converters:
converter = registration.converter
# The Routing Check
if converter.accepts(file_stream, stream_info, **kwargs):
return converter.convert(file_stream, stream_info, **kwargs)
raise UnsupportedFormatException("No converter accepted this file.")
As a user, you don't usually call these methods directly, but understanding them helps you debug:
.zip file to .txt, and MarkItDown will likely still unzip it because Magika detects the zip structure.register_converter. Identifying the stream is only half the battle; having a converter with the right Priority ensures the specific tool (like CsvConverter) grabs the file before the generic tool (PlainTextConverter).In this chapter, we explored the brain of the operation:
Now that we know how the system identifies a file and routes it, let's look at the workers themselves. What logic lives inside the PDF Converter or the Excel Converter?
Next Chapter: Format-Specific Converters
Generated by Code IQ