In the previous chapter, Format-Specific Converters, we explored how MarkItDown processes files sitting on your computer, like Excel spreadsheets and PDFs.
But in the modern world, information doesn't just live on hard drives. It lives on the web.
In this chapter, we will introduce Web & Remote Content Handlers. These are specialized converters designed to go out to the internet, navigate specific URLs, and bring back a clean report.
Imagine you are writing a research paper. You have:
For the PDF, you use a standard converter (the "Desk Worker"). But for the YouTube video and Bing results, you need a Field Reporter.
You can't just "open" a YouTube link like a file. You need a tool that understands:
MarkItDown includes specific "Field Reporters" for these tasks.
In Stream Identification & Routing, we learned that MarkItDown usually looks at file extensions or binary signatures.
For web content, it looks at the URL.
If you pass a string starting with https://, MarkItDown fetches the content. Then, specific converters check that URL to see if they should take the job.
Let's look at the most powerful example: The YouTubeConverter. It doesn't just give you the text on the page; it attempts to download the closed captions (subtitles) to give you a full text transcript of the video.
You use the same .convert() method you are used to.
from markitdown import MarkItDown
md = MarkItDown()
# Pass a YouTube URL instead of a filename
result = md.convert("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
# Print the Title and Transcript
print(result.text_content)
The output isn't a mess of HTML code. It is a clean Markdown report.
# YouTube
## Rick Astley - Never Gonna Give You Up (Official Music Video)
### Video Metadata
- **Views:** 1400000000
- **Runtime:** PT3M33S
### Transcript
We're no strangers to love
You know the rules and so do I
...
Note: This requires the youtube_transcript_api library to be installed.
Let's verify how the "Field Reporter" does its job using a diagram.
accepts() Method
Unlike local files where we check for .xlsx, here we check if the URL looks like a YouTube link.
Here is simplified logic from _youtube_converter.py:
def accepts(self, file_stream, stream_info, **kwargs):
url = stream_info.url or ""
# Check if the URL matches the YouTube pattern
if "youtube.com/watch?" in url:
return True
return False
convert() MethodThe conversion happens in two stages:
BeautifulSoup to read the title and description from the page source.def convert(self, file_stream, stream_info, **kwargs):
# 1. Parse HTML for Metadata
soup = bs4.BeautifulSoup(file_stream, "html.parser")
title = soup.title.string
# 2. Fetch Transcript (Simplified)
video_id = get_video_id(stream_info.url)
transcript = YouTubeTranscriptApi.get_transcript(video_id)
# 3. Format as Markdown
md = f"# {title}\n\n### Transcript\n{transcript}"
return DocumentConverterResult(markdown=md)
Another specialized handler included in MarkItDown is the BingSerpConverter.
If you try to convert a search result page using a standard HTML converter, you get a lot of noise (navigation bars, footers, ads). The specialized Bing converter extracts only the search results.
# A URL representing a Bing search for "Microsoft"
url = "https://www.bing.com/search?q=Microsoft"
result = md.convert(url)
print(result.text_content)
_bing_serp_converter.py)
This converter specifically looks for the CSS classes that Bing uses for search results (like b_algo).
def convert(self, file_stream, stream_info, **kwargs):
soup = BeautifulSoup(file_stream, "html.parser")
results = []
# Find only the search result blocks
for item in soup.find_all(class_="b_algo"):
# Convert just this block to Markdown
text = self._markdownify(item)
results.append(text)
return DocumentConverterResult(markdown="\n".join(results))
By targeting specific HTML classes (b_algo), the converter acts like a smart filter, discarding the "junk" and keeping the data.
You can write your own handlers for websites you use often (e.g., a documentation site or a news aggregator).
DocumentConverter.accepts: Check stream_info.url for your specific domain.convert: Use BeautifulSoup to find the specific <div> that holds the main content.In this chapter, you learned:
YouTubeConverter and BingSerpConverter extract specific data (like transcripts) while ignoring noise (like ads).accepts() method can check URLs to decide which converter to use.We have now covered local files and specific web pages. But sometimes, a file is just an image, or a chart with no text. How do we extract information from pixels?
In the next chapter, we will introduce the most powerful feature of MarkItDown: AI Integration.
Next Chapter: AI & Intelligence Integration
Generated by Code IQ