Chapter 3 · CORE

Stream Identification & Routing

📄 03_stream_identification___routing.md 🏷 Core

Chapter 3: Stream Identification & Routing

In the previous chapter, The DocumentConverter Interface, we learned how to build "Specialists" (Converters) that can handle specific jobs.

But if you dump a pile of files on the desk—some labeled correctly, some missing extensions, and some corrupted—how does the system know which Specialist to call? You wouldn't send a PDF to an Image Converter just because someone renamed it image.png.

This is where Stream Identification & Routing comes in.

The Problem: The Mystery File

Imagine you are processing email attachments. You receive a file named data.tmp.

  1. Is it a spreadsheet?
  2. Is it a Word doc?
  3. Is it just plain text?

If MarkItDown relied solely on file extensions, it would fail immediately. It needs to look inside the file to understand what it actually is.

The Solution: The Triage Nurse

Think of the internal logic of MarkItDown as a Triage Nurse in a hospital emergency room.

  1. Patient Arrival: A stream of bytes arrives (the file).
  2. Initial Look: The nurse looks at the name (File Extension).
  3. Vitals Check: The nurse takes a blood sample (Deep Content Inspection).
  4. Routing: The nurse assigns the patient to the Heart Specialist, Orthopedist, or General Practitioner based on the data, not just what the patient says is wrong.

The "ID Card": StreamInfo

To make these decisions, MarkItDown creates a standardized "ID Card" for every file called StreamInfo.

Before any conversion happens, the system gathers as much evidence as possible and fills out this card.

# Simplified view of the StreamInfo object
@dataclass
class StreamInfo:
    mimetype: Optional[str] = None   # e.g., "application/pdf"
    extension: Optional[str] = None  # e.g., ".pdf"
    charset: Optional[str] = None    # e.g., "utf-8"
    filename: Optional[str] = None   # e.g., "report.pdf"

The goal of the Identification phase is to fill this card with accurate data so the Converters can make an informed decision.

Step 1: Deep Inspection with Magika

How does MarkItDown know a file is a PDF if it doesn't have a .pdf extension?

It uses a tool called Magika (developed by Google). Magika uses AI to look at the raw binary bytes of a file to determine its type with high precision.

The Identification Process

Here is what happens inside MarkItDown._get_stream_info_guesses:

  1. Magika reads the bytes: It scans the file content.
  2. Magika predicts: "This looks 99% like an Excel file."
  3. Conflict Resolution: If the filename says .txt but Magika says "Excel," MarkItDown creates two ID cards (guesses).

It prioritizes the "Deep Inspection" (Guess 1) usually, ensuring that even misnamed files get converted correctly.

Step 2: The Routing Loop

Once the system has its "Guesses" (the ID Cards), it walks down the hallway of Specialists (Converters) and asks: "Can you help?"

This is the Routing phase.

sequenceDiagram participant Orchestrator participant Magika participant XlsxConverter participant TextConverter Note over Orchestrator: Input: "data.tmp" Orchestrator->>Magika: Identify this stream! Magika-->>Orchestrator: It's an Excel file (MIME: application/vnd.openxml...) Note over Orchestrator: Creates StreamInfo(mimetype="application/xlsx") Orchestrator->>XlsxConverter: accepts(stream, info)? XlsxConverter-->>Orchestrator: Yes! Orchestrator->>XlsxConverter: convert() XlsxConverter-->>Orchestrator: Returns Result

Internal Implementation

Let's look at the actual code that powers this logic.

1. Generating Guesses (_markitdown.py)

The method _get_stream_info_guesses is the Triage Nurse. It combines the file extension (if known) with Magika's prediction.

def _get_stream_info_guesses(self, file_stream, base_guess):
    guesses = []
    
    # Ask Magika what the stream contains
    result = self._magika.identify_stream(file_stream)
    
    # If Magika is confident, create a StreamInfo based on its finding
    if result.status == "ok":
         guesses.append(StreamInfo(
             mimetype=result.prediction.output.mime_type,
             # ... other details ...
         ))
         
    return guesses

2. The Great Loop (_convert)

Once we have the guesses, we iterate through the converters. This logic handles the routing.

# Simplified from _markitdown.py
def _convert(self, file_stream, stream_info_guesses, **kwargs):
    
    # Try every guess (e.g., first try treating it as Excel, then as Text)
    for stream_info in stream_info_guesses:
        
        # Ask every specialist
        for registration in self._converters:
            converter = registration.converter
            
            # The Routing Check
            if converter.accepts(file_stream, stream_info, **kwargs):
                return converter.convert(file_stream, stream_info, **kwargs)

    raise UnsupportedFormatException("No converter accepted this file.")

Why This Matters for You

As a user, you don't usually call these methods directly, but understanding them helps you debug:

  1. Misleading Extensions: You can rename a .zip file to .txt, and MarkItDown will likely still unzip it because Magika detects the zip structure.
  2. Ambiguous Files: If a file could be two things (like a generic XML file), the system tries the most specific converter first (Priority), then falls back to generic text.
  3. Priority Control: In Chapter 1, we saw register_converter. Identifying the stream is only half the battle; having a converter with the right Priority ensures the specific tool (like CsvConverter) grabs the file before the generic tool (PlainTextConverter).

Summary

In this chapter, we explored the brain of the operation:

  1. StreamInfo: The "ID Card" that describes a file's identity.
  2. Magika: The "Forensic Analyst" that inspects file bytes to detect the true format.
  3. Routing: The process of iterating through guesses and converters until a match is found.

Now that we know how the system identifies a file and routes it, let's look at the workers themselves. What logic lives inside the PDF Converter or the Excel Converter?

Next Chapter: Format-Specific Converters


Generated by Code IQ