In the previous chapter, The MarkItDown Orchestrator, we introduced the MarkItDown class as a "General Contractor" that manages file conversions.
But a contractor is nothing without skilled workers. In this ecosystem, the workers are called Converters.
Imagine if every time you bought a new toaster or lamp, you had to rewire your house to fit it. That would be a nightmare. Instead, we have standard electrical sockets. If a device has the right plug, it works.
The DocumentConverter Interface is that standard socket for MarkItDown.
It defines a strict "Job Description" that allows any tool—whether it reads PDFs, YouTube captions, or Excel sheets—to plug into the main system. As long as the tool follows the rules of the interface, the Orchestrator can use it.
The interface is defined by a class called DocumentConverter. To create a new file handler, you must follow two main rules (methods):
accepts(): A method that answers "Can I handle this file?"convert(): A method that actually performs the work and returns Markdown.Let's break these down.
accepts() MethodBefore the Orchestrator hands over a file, it asks the converter: "Is this for you?"
The converter looks at the file's "ID Card" (called StreamInfo), which contains the file extension and mimetype.
def accepts(self, file_stream, stream_info, **kwargs):
# Check if the file extension is ".txt"
if stream_info.extension == ".txt":
return True
# If not, say "No, I can't do this job"
return False
convert() Method
If accepts() returns True, the Orchestrator calls convert(). This method receives the raw data stream, reads it, and transforms it.
def convert(self, file_stream, stream_info, **kwargs):
# Read the data from the stream
text_data = file_stream.read().decode("utf-8")
# Return the result object
return DocumentConverterResult(markdown=text_data)
To understand this better, let's build a useless but educational converter. This converter will accept files ending in .rev and write the text backwards.
We need to import the base classes. These act as the template we must fill in.
from markitdown import DocumentConverter, DocumentConverterResult
# Create our class inheriting from the blueprint
class ReverseConverter(DocumentConverter):
pass
accepts
We tell our converter to only raise its hand if the file extension is .rev.
def accepts(self, file_stream, stream_info, **kwargs):
# We look at stream_info to see the file details
extension = (stream_info.extension or "").lower()
# We only want .rev files
return extension == ".rev"
convert
Now we do the actual logic. We read the stream, reverse the string, and wrap it in a DocumentConverterResult.
def convert(self, file_stream, stream_info, **kwargs):
# 1. Read the bytes and decode to string
content = file_stream.read().decode("utf-8")
# 2. Reverse the string (Python magic)
reversed_markdown = content[::-1]
# 3. Return the standard result object
return DocumentConverterResult(markdown=reversed_markdown)
Now we introduce our new worker to the General Contractor.
from markitdown import MarkItDown
md = MarkItDown()
# Register our new specialist!
md.register_converter(ReverseConverter())
# Now MarkItDown knows how to handle .rev files
# result = md.convert("secret_message.rev")
When you define a converter, you are interacting with _base_converter.py. Let's visualize the conversation between the Orchestrator and your Interface.
You might have noticed DocumentConverterResult. Why not just return a string?
The DocumentConverterResult (defined in _base_converter.py) is a container that holds the Markdown, but allows room for extra data, like the Title of the document.
class DocumentConverterResult:
def __init__(self, markdown, title=None):
self.markdown = markdown
self.title = title
This means if your converter is smart (like a PDF parser), it can extract the document title and pass it along separately from the body text.
BinaryIO and StreamInfoThe interface doesn't just give you a filename string. It gives you:
file_stream (BinaryIO): An open data pipe. You can .read() it.stream_info (StreamInfo): A neat package containing metadata.
We will learn exactly how StreamInfo is generated in the next chapter.
In this chapter, you learned:
DocumentConverter is the abstract blueprint that ensures all tools fit into the system.accepts() and perform convert().DocumentConverterResult object, not just a plain string.
Now that we have our specialists ready, how does the Orchestrator generate that "ID Card" (StreamInfo) to decide which specialist to call?
Next Chapter: Stream Identification & Routing
Generated by Code IQ