Chapter 1 ยท CORE

The MarkItDown Orchestrator

๐Ÿ“„ 01_the_markitdown_orchestrator.md ๐Ÿท Core

Chapter 1: The MarkItDown Orchestrator

Welcome to the MarkItDown tutorial! In this first chapter, we will explore the heart of the library: the MarkItDown class.

The Problem: File Chaos

Imagine you have a directory full of random files: a PDF report, an Excel spreadsheet, a ZIP archive, and a URL to a blog post. You want to extract the text from all of them and feed it into a Large Language Model (LLM) or a search index.

Writing separate code to handle PDFs, another script for Excel, and another for web scraping is exhausting. You don't want to worry about how to read a PDF; you just want the text.

The Solution: The General Contractor

This is where the MarkItDown Orchestrator comes in.

Think of the MarkItDown class as a General Contractor (or a Dispatcher).

  1. You (The Client) hand the Contractor a job (a file or URL).
  2. The Contractor looks at the job. "Ah, this is a PDF."
  3. The Contractor checks their contact list of Specialists (Converters).
  4. The Contractor assigns the PDF Specialist to do the work.
  5. You get the finished result (Markdown text).

You don't need to know which specialist did the work; the Orchestrator handles the logistics for you.

Minimal Usage Example

Let's look at how simple it is to use the Orchestrator. We will perform a conversion in just 4 lines of code.

Step 1: Initialization

First, we import the class and hire our "General Contractor" by creating an instance of MarkItDown.

from markitdown import MarkItDown

# Initialize the Orchestrator
# This automatically loads all built-in specialists (converters)
md = MarkItDown()

When you run this, MarkItDown is internally preparing its tools, setting up a web session, and loading tools like Magika (for detecting file types).

Step 2: The Conversion

Now, we hand it a file. The .convert() method is the magic button.

# We just pass the filename. 
# We don't need to tell it that it's an Excel file.
result = md.convert("sales_data.xlsx")

# Print the text content converted to Markdown
print(result.text_content)

That's it. The Orchestrator inspected sales_data.xlsx, realized it needed the Excel specialist, performed the conversion, and returned the text.

How It Works Under the Hood

What actually happens when you call .convert()? Let's visualize the flow.

sequenceDiagram participant User participant Orchestrator as MarkItDown Class participant Identifier as Stream Identification participant Converter as Specific Converter (e.g., PDF) User->>Orchestrator: Call .convert("doc.pdf") Orchestrator->>Identifier: What file is this? Identifier-->>Orchestrator: It looks like a PDF! Orchestrator->>Orchestrator: Check list of Converters... Orchestrator->>Converter: Can you handle PDF? Converter-->>Orchestrator: Yes! Orchestrator->>Converter: Convert this stream. Converter-->>Orchestrator: Here is the Markdown. Orchestrator-->>User: Return Result

1. Registration (Hiring the Specialists)

When you first create the MarkItDown object, it calls a method called enable_builtins(). This populates a list inside the Orchestrator with all the available tools.

Here is a simplified look at the internal code in _markitdown.py:

def enable_builtins(self, **kwargs) -> None:
    # ... setup code ...

    # Register the specialists!
    self.register_converter(PlainTextConverter())
    self.register_converter(HtmlConverter())
    self.register_converter(PdfConverter())
    self.register_converter(XlsxConverter())
    # ... and many more

The Orchestrator now has a list of "employees" ready to work.

2. Smart Routing (Assigning the Job)

When you call .convert(), the Orchestrator opens the file and tries to figure out what it is. We will cover exactly how it guesses the file type in Stream Identification & Routing.

Once it has a guess, it loops through its list of converters. It asks each one, "Do you accept this file?"

Here is the logic from the _convert method (simplified):

# simplified from packages/markitdown/src/markitdown/_markitdown.py

# Sort converters by priority (specialized ones first)
sorted_registrations = sorted(self._converters, key=lambda x: x.priority)

for converter_reg in sorted_registrations:
    converter = converter_reg.converter
    
    # Ask the converter if it can handle this file
    if converter.accepts(file_stream, stream_info):
        # If yes, do the job!
        return converter.convert(file_stream, stream_info)

This loop ensures that the most specific tool is used. For example, if you have a CSV file, the CsvConverter will likely grab it before the generic PlainTextConverter does.

3. Handling URLs vs Local Files

The Orchestrator is smart enough to handle different input types automatically.

# Handling a URL
result_web = md.convert("https://en.wikipedia.org/wiki/Markdown")

# Handling a local file
result_local = md.convert("/path/to/document.pdf")

The internal convert method acts as a traffic controller, routing these inputs to convert_url or convert_local seamlessly.

Configuration

The Orchestrator also manages global configuration. If you need to use AI to describe images or handle plugins, you configure the Orchestrator, and it passes those settings down to the individual converters.

# Example: Enabling 3rd party plugins
md = MarkItDown(enable_plugins=True)

Summary

In this chapter, you learned:

  1. The Role: MarkItDown acts as a "General Contractor" or Orchestrator.
  2. Usage: You initialize it once and call .convert() for any file type.
  3. Process: It maintains a registry of converters, identifies the file type, and routes the work to the correct specialist.

But what exactly does a "Specialist" look like? How do they know if they can accept a file?

In the next chapter, we will look at the contract that every specialist must sign: The DocumentConverter Interface.

Next Chapter: The DocumentConverter Interface


Generated by Code IQ