In the previous chapter, Stream Identification & Routing, we watched the "Triage Nurse" (the routing logic) identify a file and assign it to a specialist.
Now, we are going to meet the Specialists.
These are the classes that do the heavy lifting. If the Orchestrator is the manager, these are the workers who actually know how to read binary data—whether it's a spreadsheet, a slide deck, or a document—and translate it into Markdown.
Think of MarkItDown as a Translation Agency.
A generic translator can't handle all of these. You need specific Format-Specific Converters.
MarkItDown does not try to reinvent the wheel. Writing a PDF parser from scratch is incredibly difficult. Instead, these converters act as wrappers around existing, powerful open-source libraries.
pandas (a data analysis library).pdfminer and pdfplumber.mammoth.MarkItDown acts as the glue that standardizes their output.
XlsxConverter)
Let's look at how the XlsxConverter turns a spreadsheet into Markdown.
Markdown doesn't have "sheets" like Excel. So, the converter has to make a choice on how to represent them.
## Sheet1) for each sheet name.
Here is a simplified look at the convert method inside _xlsx_converter.py.
def convert(self, file_stream, stream_info, **kwargs):
# 1. Use the 'pandas' library to read the excel file
sheets = pd.read_excel(file_stream, sheet_name=None)
md_content = ""
# 2. Loop through every sheet found
for sheet_name in sheets:
# Add the sheet name as a header
md_content += f"## {sheet_name}\n"
# ... logic to convert rows to table ...
You might wonder: "How do we get the table lines (|---|) right?"
The XlsxConverter actually uses a trick. It converts the data to HTML first, and then uses a helper called HtmlConverter to turn that HTML into Markdown.
# Convert data to HTML
html_content = sheets[sheet_name].to_html(index=False)
# Use a helper to turn HTML -> Markdown
md_content += self._html_converter.convert_string(html_content).markdown
Result: A complex binary Excel file becomes a clean, readable text list of tables.
PptxConverter)
PowerPoint files (.pptx) are tricky because they aren't just text; they are visual canvases.
The PptxConverter treats the presentation like a book.
<!-- Slide number: 1 -->).# Title).What happens to images in a slide deck? The converter tries to describe them!
# Simplified logic from _pptx_converter.py
if self._is_picture(shape):
# Get the "Alt Text" description if the author wrote one
alt_text = shape.name
# Create a markdown image link
md_content += f"\n\n"
Note: In Chapter 6: AI & Intelligence Integration, we will see how MarkItDown can actually use AI to look at the slide image and write a description for it automatically!
PdfConverter)PDF is the "Final Boss" of file conversion. Why? Because a PDF doesn't know what a "paragraph" or a "table" is. It only knows instructions like: "Draw the letter 'H' at x=10, y=20."
MarkItDown has to be very smart to reconstruct the text.
The PdfConverter uses a complex flowchart to decide how to read the file.
The code in _pdf_converter.py contains a massive function called _extract_form_content_from_words.
It uses math to look at the gaps between words.
This allows MarkItDown to recreate complex invoices and forms that other converters might turn into a garbled mess of text.
What happens if you try to use the XlsxConverter but you don't have pandas installed on your computer?
The converters are designed to be safe. They check for their "tools" before starting work.
# Inside the convert() method
if _xlsx_dependency_exc_info is not None:
# We found that pandas was missing during startup
raise MissingDependencyException(
"Could not convert .xlsx. Please install pandas."
)
This ensures the system doesn't crash unexpectedly; it tells you exactly what is missing.
In this chapter, we looked inside the black box of the Format-Specific Converters.
pandas, pdfminer, and mammoth.HtmlConverter, to avoid rewriting code.So far, we have only talked about files sitting on your hard drive. But what if the information you want is on the Internet?
In the next chapter, we will see how MarkItDown handles URLs, HTML, and web scraping.
Next Chapter: Web & Remote Content Handlers
Generated by Code IQ