Welcome to the MarkItDown tutorial! In this first chapter, we will explore the heart of the library: the MarkItDown class.
Imagine you have a directory full of random files: a PDF report, an Excel spreadsheet, a ZIP archive, and a URL to a blog post. You want to extract the text from all of them and feed it into a Large Language Model (LLM) or a search index.
Writing separate code to handle PDFs, another script for Excel, and another for web scraping is exhausting. You don't want to worry about how to read a PDF; you just want the text.
This is where the MarkItDown Orchestrator comes in.
Think of the MarkItDown class as a General Contractor (or a Dispatcher).
You don't need to know which specialist did the work; the Orchestrator handles the logistics for you.
Let's look at how simple it is to use the Orchestrator. We will perform a conversion in just 4 lines of code.
First, we import the class and hire our "General Contractor" by creating an instance of MarkItDown.
from markitdown import MarkItDown
# Initialize the Orchestrator
# This automatically loads all built-in specialists (converters)
md = MarkItDown()
When you run this, MarkItDown is internally preparing its tools, setting up a web session, and loading tools like Magika (for detecting file types).
Now, we hand it a file. The .convert() method is the magic button.
# We just pass the filename.
# We don't need to tell it that it's an Excel file.
result = md.convert("sales_data.xlsx")
# Print the text content converted to Markdown
print(result.text_content)
That's it. The Orchestrator inspected sales_data.xlsx, realized it needed the Excel specialist, performed the conversion, and returned the text.
What actually happens when you call .convert()? Let's visualize the flow.
When you first create the MarkItDown object, it calls a method called enable_builtins(). This populates a list inside the Orchestrator with all the available tools.
Here is a simplified look at the internal code in _markitdown.py:
def enable_builtins(self, **kwargs) -> None:
# ... setup code ...
# Register the specialists!
self.register_converter(PlainTextConverter())
self.register_converter(HtmlConverter())
self.register_converter(PdfConverter())
self.register_converter(XlsxConverter())
# ... and many more
The Orchestrator now has a list of "employees" ready to work.
When you call .convert(), the Orchestrator opens the file and tries to figure out what it is. We will cover exactly how it guesses the file type in Stream Identification & Routing.
Once it has a guess, it loops through its list of converters. It asks each one, "Do you accept this file?"
Here is the logic from the _convert method (simplified):
# simplified from packages/markitdown/src/markitdown/_markitdown.py
# Sort converters by priority (specialized ones first)
sorted_registrations = sorted(self._converters, key=lambda x: x.priority)
for converter_reg in sorted_registrations:
converter = converter_reg.converter
# Ask the converter if it can handle this file
if converter.accepts(file_stream, stream_info):
# If yes, do the job!
return converter.convert(file_stream, stream_info)
This loop ensures that the most specific tool is used. For example, if you have a CSV file, the CsvConverter will likely grab it before the generic PlainTextConverter does.
The Orchestrator is smart enough to handle different input types automatically.
http or https, it downloads the file first.# Handling a URL
result_web = md.convert("https://en.wikipedia.org/wiki/Markdown")
# Handling a local file
result_local = md.convert("/path/to/document.pdf")
The internal convert method acts as a traffic controller, routing these inputs to convert_url or convert_local seamlessly.
The Orchestrator also manages global configuration. If you need to use AI to describe images or handle plugins, you configure the Orchestrator, and it passes those settings down to the individual converters.
# Example: Enabling 3rd party plugins
md = MarkItDown(enable_plugins=True)
In this chapter, you learned:
MarkItDown acts as a "General Contractor" or Orchestrator..convert() for any file type.But what exactly does a "Specialist" look like? How do they know if they can accept a file?
In the next chapter, we will look at the contract that every specialist must sign: The DocumentConverter Interface.
Next Chapter: The DocumentConverter Interface
Generated by Code IQ