Chapter 2 · CORE

The DocumentConverter Interface

📄 02_the_documentconverter_interface.md 🏷 Core

Chapter 2: The DocumentConverter Interface

In the previous chapter, The MarkItDown Orchestrator, we introduced the MarkItDown class as a "General Contractor" that manages file conversions.

But a contractor is nothing without skilled workers. In this ecosystem, the workers are called Converters.

The Motivation: A Universal Plug

Imagine if every time you bought a new toaster or lamp, you had to rewire your house to fit it. That would be a nightmare. Instead, we have standard electrical sockets. If a device has the right plug, it works.

The DocumentConverter Interface is that standard socket for MarkItDown.

It defines a strict "Job Description" that allows any tool—whether it reads PDFs, YouTube captions, or Excel sheets—to plug into the main system. As long as the tool follows the rules of the interface, the Orchestrator can use it.

The Blueprint

The interface is defined by a class called DocumentConverter. To create a new file handler, you must follow two main rules (methods):

accepts(): A method that answers "Can I handle this file?"
convert(): A method that actually performs the work and returns Markdown.

Let's break these down.

Rule 1: The `accepts()` Method

Before the Orchestrator hands over a file, it asks the converter: "Is this for you?"

The converter looks at the file's "ID Card" (called StreamInfo), which contains the file extension and mimetype.

def accepts(self, file_stream, stream_info, **kwargs):
    # Check if the file extension is ".txt"
    if stream_info.extension == ".txt":
        return True
    
    # If not, say "No, I can't do this job"
    return False

Rule 2: The `convert()` Method

If accepts() returns True, the Orchestrator calls convert(). This method receives the raw data stream, reads it, and transforms it.

def convert(self, file_stream, stream_info, **kwargs):
    # Read the data from the stream
    text_data = file_stream.read().decode("utf-8")
    
    # Return the result object
    return DocumentConverterResult(markdown=text_data)

Tutorial: Building a "Reverse" Converter

To understand this better, let's build a useless but educational converter. This converter will accept files ending in .rev and write the text backwards.

Step 1: The Setup

We need to import the base classes. These act as the template we must fill in.

from markitdown import DocumentConverter, DocumentConverterResult

# Create our class inheriting from the blueprint
class ReverseConverter(DocumentConverter):
    pass

Step 2: Implementing `accepts`

We tell our converter to only raise its hand if the file extension is .rev.

    def accepts(self, file_stream, stream_info, **kwargs):
        # We look at stream_info to see the file details
        extension = (stream_info.extension or "").lower()
        
        # We only want .rev files
        return extension == ".rev"

Step 3: Implementing `convert`

Now we do the actual logic. We read the stream, reverse the string, and wrap it in a DocumentConverterResult.

    def convert(self, file_stream, stream_info, **kwargs):
        # 1. Read the bytes and decode to string
        content = file_stream.read().decode("utf-8")
        
        # 2. Reverse the string (Python magic)
        reversed_markdown = content[::-1]
        
        # 3. Return the standard result object
        return DocumentConverterResult(markdown=reversed_markdown)

Step 4: Plugging it In

Now we introduce our new worker to the General Contractor.

from markitdown import MarkItDown

md = MarkItDown()

# Register our new specialist!
md.register_converter(ReverseConverter())

# Now MarkItDown knows how to handle .rev files
# result = md.convert("secret_message.rev")

How It Works Under the Hood

When you define a converter, you are interacting with _base_converter.py. Let's visualize the conversation between the Orchestrator and your Interface.

sequenceDiagram participant Orchestrator participant RevConverter as ReverseConverter participant Stream as File Data Note over Orchestrator: User calls .convert("file.rev") Orchestrator->>RevConverter: call accepts(stream, info) RevConverter->>RevConverter: Check info.extension RevConverter-->>Orchestrator: Returns True Orchestrator->>RevConverter: call convert(stream, info) RevConverter->>Stream: read() Stream-->>RevConverter: "Hello World" RevConverter->>RevConverter: Reverse text -> "dlroW olleH" RevConverter-->>Orchestrator: Returns DocumentConverterResult

The Result Object

You might have noticed DocumentConverterResult. Why not just return a string?

The DocumentConverterResult (defined in _base_converter.py) is a container that holds the Markdown, but allows room for extra data, like the Title of the document.

class DocumentConverterResult:
    def __init__(self, markdown, title=None):
        self.markdown = markdown
        self.title = title

This means if your converter is smart (like a PDF parser), it can extract the document title and pass it along separately from the body text.

The Input: `BinaryIO` and `StreamInfo`

The interface doesn't just give you a filename string. It gives you:

file_stream (BinaryIO): An open data pipe. You can .read() it.
stream_info (StreamInfo): A neat package containing metadata.

We will learn exactly how StreamInfo is generated in the next chapter.

Summary

In this chapter, you learned:

The Interface: DocumentConverter is the abstract blueprint that ensures all tools fit into the system.
The Contract: Every converter must answer accepts() and perform convert().
The Result: Converters return a DocumentConverterResult object, not just a plain string.

Now that we have our specialists ready, how does the Orchestrator generate that "ID Card" (StreamInfo) to decide which specialist to call?

Next Chapter: Stream Identification & Routing

Generated by Code IQ

← Previous

The MarkItDown Orchestrator

Stream Identification & Routing