Chapter 5 ยท CORE

Web & Remote Content Handlers

๐Ÿ“„ 05_web___remote_content_handlers.md ๐Ÿท Core

Chapter 5: Web & Remote Content Handlers

In the previous chapter, Format-Specific Converters, we explored how MarkItDown processes files sitting on your computer, like Excel spreadsheets and PDFs.

But in the modern world, information doesn't just live on hard drives. It lives on the web.

In this chapter, we will introduce Web & Remote Content Handlers. These are specialized converters designed to go out to the internet, navigate specific URLs, and bring back a clean report.

The Motivation: The Field Reporter

Imagine you are writing a research paper. You have:

  1. A PDF on your desktop.
  2. A YouTube video lecture.
  3. A page of Bing search results.

For the PDF, you use a standard converter (the "Desk Worker"). But for the YouTube video and Bing results, you need a Field Reporter.

You can't just "open" a YouTube link like a file. You need a tool that understands:

MarkItDown includes specific "Field Reporters" for these tasks.

Concept: URL-Based Routing

In Stream Identification & Routing, we learned that MarkItDown usually looks at file extensions or binary signatures.

For web content, it looks at the URL.

If you pass a string starting with https://, MarkItDown fetches the content. Then, specific converters check that URL to see if they should take the job.

Tutorial: Converting a YouTube Video

Let's look at the most powerful example: The YouTubeConverter. It doesn't just give you the text on the page; it attempts to download the closed captions (subtitles) to give you a full text transcript of the video.

Step 1: The Request

You use the same .convert() method you are used to.

from markitdown import MarkItDown

md = MarkItDown()

# Pass a YouTube URL instead of a filename
result = md.convert("https://www.youtube.com/watch?v=dQw4w9WgXcQ")

# Print the Title and Transcript
print(result.text_content)

Step 2: The Output

The output isn't a mess of HTML code. It is a clean Markdown report.

# YouTube
## Rick Astley - Never Gonna Give You Up (Official Music Video)

### Video Metadata
- **Views:** 1400000000
- **Runtime:** PT3M33S

### Transcript
We're no strangers to love
You know the rules and so do I
...

Note: This requires the youtube_transcript_api library to be installed.

How It Works Under the Hood

Let's verify how the "Field Reporter" does its job using a diagram.

sequenceDiagram participant User participant Orchestrator participant YTConverter as YouTube Converter participant YouTubeAPI as YouTube / Transcript API User->>Orchestrator: .convert("https://youtube.com/...") Orchestrator->>Orchestrator: Downloads HTML from URL Orchestrator->>YTConverter: accepts(url)? YTConverter-->>Orchestrator: Yes, I know this URL format! Orchestrator->>YTConverter: convert(html_stream) YTConverter->>YTConverter: Extract Title & Description from HTML YTConverter->>YouTubeAPI: Request Transcript (Subtitles) YouTubeAPI-->>YTConverter: Return text YTConverter-->>Orchestrator: Returns formatted Markdown

The accepts() Method

Unlike local files where we check for .xlsx, here we check if the URL looks like a YouTube link.

Here is simplified logic from _youtube_converter.py:

def accepts(self, file_stream, stream_info, **kwargs):
    url = stream_info.url or ""
    
    # Check if the URL matches the YouTube pattern
    if "youtube.com/watch?" in url:
        return True
        
    return False

The convert() Method

The conversion happens in two stages:

  1. HTML Parsing: It uses BeautifulSoup to read the title and description from the page source.
  2. API Call: It uses a helper library to fetch the actual spoken words.
def convert(self, file_stream, stream_info, **kwargs):
    # 1. Parse HTML for Metadata
    soup = bs4.BeautifulSoup(file_stream, "html.parser")
    title = soup.title.string
    
    # 2. Fetch Transcript (Simplified)
    video_id = get_video_id(stream_info.url)
    transcript = YouTubeTranscriptApi.get_transcript(video_id)
    
    # 3. Format as Markdown
    md = f"# {title}\n\n### Transcript\n{transcript}"
    
    return DocumentConverterResult(markdown=md)

Tutorial: Bing Search Results

Another specialized handler included in MarkItDown is the BingSerpConverter.

If you try to convert a search result page using a standard HTML converter, you get a lot of noise (navigation bars, footers, ads). The specialized Bing converter extracts only the search results.

The Usage

# A URL representing a Bing search for "Microsoft"
url = "https://www.bing.com/search?q=Microsoft"

result = md.convert(url)
print(result.text_content)

The Logic (_bing_serp_converter.py)

This converter specifically looks for the CSS classes that Bing uses for search results (like b_algo).

def convert(self, file_stream, stream_info, **kwargs):
    soup = BeautifulSoup(file_stream, "html.parser")
    
    results = []
    
    # Find only the search result blocks
    for item in soup.find_all(class_="b_algo"):
        # Convert just this block to Markdown
        text = self._markdownify(item)
        results.append(text)
        
    return DocumentConverterResult(markdown="\n".join(results))

By targeting specific HTML classes (b_algo), the converter acts like a smart filter, discarding the "junk" and keeping the data.

Writing Your Own Web Handler

You can write your own handlers for websites you use often (e.g., a documentation site or a news aggregator).

  1. Inherit from DocumentConverter.
  2. Implement accepts: Check stream_info.url for your specific domain.
  3. Implement convert: Use BeautifulSoup to find the specific <div> that holds the main content.

Summary

In this chapter, you learned:

  1. Field Reporters: MarkItDown has converters designed for specific URLs, not just file types.
  2. Smart Scrapers: Tools like YouTubeConverter and BingSerpConverter extract specific data (like transcripts) while ignoring noise (like ads).
  3. URL Routing: The accepts() method can check URLs to decide which converter to use.

We have now covered local files and specific web pages. But sometimes, a file is just an image, or a chart with no text. How do we extract information from pixels?

In the next chapter, we will introduce the most powerful feature of MarkItDown: AI Integration.

Next Chapter: AI & Intelligence Integration


Generated by Code IQ