Chapter 5 Β· ADVANCED

AI Integration (MCP)

πŸ“„ 05_ai_integration__mcp_.md 🏷 Advanced

Chapter 5: AI Integration (MCP)

Welcome back!

In Chapter 4: Crawl Orchestration, we built a powerful engine to manage thousands of scrapers like a factory line.

But sometimes, you don't need a factory. Sometimes, you have a very smart "Brain" (an Artificial Intelligence like Claude, GPT-4, or Llama) that needs "Eyes" to see the world.

Standard AI models are like encyclopedias locked in a room without internet access. They know everything about the past but nothing about what is happening on a website right now.

In this chapter, we introduce AI Integration via MCP. This acts as a bridge, allowing an AI agent to "drive" Scrapling's vehicles to fetch and read web pages automatically.

The Motivation: Giving the AI "Eyes"

Imagine you are chatting with an AI assistant and you ask: "Can you summarize the latest article on this news site?"

Without Scrapling, the AI says: "I cannot browse the live web."

With Scrapling's MCP Integration, the flow changes:

  1. You: "Check this URL."
  2. AI: "I need to see that page. Scrapling, can you fetch it for me?"
  3. Scrapling: "Sure! Here is the text content (converted to Markdown)."
  4. AI: "Thanks! Here is the summary..."

We don't need to write a scraper for every site. We just give the AI the Tool to fetch data itself.

The Concept: What is MCP?

MCP stands for Model Context Protocol. Think of it as a "Universal USB Cable" for AI.

The Toolset

When we turn on this integration, Scrapling exposes its internal Fetchers (which we learned about in Chapter 1: Fetchers Interface) as tools the AI can call.

The AI gets access to three main capabilities:

  1. get: The Motorbike. Fast, for simple HTML pages.
  2. fetch: The Van. Opens a browser for dynamic sites.
  3. stealthy_fetch: The Spy Car. Solves Cloudflare/Captchas automatically.

Crucially, Scrapling doesn't just give the AI raw HTML (which is messy). It automatically converts the page to Markdown, which LLMs understand perfectly.

How to Use It

You don't typically write Python scripts to use the MCP server. Instead, you run the server, and your AI Client connects to it.

However, to see how easy it is to start, here is how you launch it within Python:

from scrapling import ScraplingMCPServer

# Initialize the server wrapper
server = ScraplingMCPServer()

# Start the server (using Standard IO pipe)
# This allows AI apps to talk to it via the command line
if __name__ == "__main__":
    server.serve(http=False, host="localhost", port=8000)

What happens here? The script starts and waits silently. It doesn't print anything because it's listening for digital signals from an AI agent on your computer.

Conceptual Workflow

If you were to connect this to an AI Agent (like Claude Desktop), you would simply tell the AI:

"Use the stealthy_fetch tool to read https://example.com."

The AI would automatically format a JSON request, send it to Scrapling, wait for the Markdown, and then read it.

Under the Hood

How does Scrapling translate a robot request into a browser action?

It uses a translation layer. The AI sends a text request (JSON), Scrapling interprets it, picks the right Fetcher, and cleans up the result.

sequenceDiagram participant User participant AI as AI Agent participant MCP as Scrapling MCP participant Fetcher as StealthyFetcher participant Web as Internet User->>AI: "Read this protected site" AI->>MCP: Call tool: stealthy_fetch(url) MCP->>Fetcher: Launch Browser (Headless) Fetcher->>Web: Request Page & Solve Captcha Web-->>Fetcher: Return HTML Fetcher-->>MCP: Return Raw Response MCP->>MCP: Convert HTML to Markdown MCP-->>AI: Return Clean Text AI-->>User: "Here is what the site says..."

Internal Code Logic

Let's look inside scrapling/core/ai.py to see how the server is defined. It acts as a wrapper around the Scrapling library.

First, it defines the structure of the data it returns to the AI (ResponseModel). The AI needs to know the status (Did it work?) and the content.

# Simplified from scrapling/core/ai.py
from pydantic import BaseModel

class ResponseModel(BaseModel):
    """What we send back to the AI"""
    status: int
    content: list[str]  # The page text or markdown
    url: str

Next, it registers the functions. Here is how the stealthy_fetch tool is exposed. Notice how it takes the complex arguments we learned in Chapter 1 and exposes them to the AI.

# Simplified from scrapling/core/ai.py

class ScraplingMCPServer:
    @staticmethod
    async def stealthy_fetch(url: str, extraction_type="markdown", **kwargs):
        """
        AI Tool: Fetches high-security pages.
        """
        # 1. Call the Spy Car (StealthyFetcher)
        page = await StealthyFetcher.async_fetch(url, **kwargs)
        
        # 2. Convert HTML to friendly Markdown
        # This uses the Adaptive Parser logic internally
        content = Convertor.extract(page, type=extraction_type)
        
        # 3. Return structured data
        return ResponseModel(status=page.status, content=content, url=page.url)

Finally, the serve method packages these functions using the FastMCP library, which handles the communication protocol.

    def serve(self, http: bool, host: str, port: int):
        # Create the MCP application
        server = FastMCP(name="Scrapling")
        
        # Add the tools so the AI knows they exist
        server.add_tool(self.get)
        server.add_tool(self.fetch)
        server.add_tool(self.stealthy_fetch)
        
        # Start listening
        server.run()

Why Markdown?

You might wonder why we convert the HTML to Markdown before giving it to the AI.

  1. Token Efficiency: HTML is full of <div>, <span>, and class="xyz". This wastes the AI's memory (context window). Markdown is concise.
  2. Readability: AI models are trained heavily on text. Markdown represents the structure of the document (Headers, Lists, Links) without the code clutter.

Summary

In this chapter, you learned:

  1. MCP (Model Context Protocol): The standard way to connect AI agents to external tools.
  2. AI Tools: How Scrapling exposes get, fetch, and stealthy_fetch as capabilities for the AI.
  3. Automatic Conversion: How Scrapling translates complex HTML into clean Markdown for the AI to read.

You have now mastered the art of fetching data, parsing it, automating it, and even connecting it to Artificial Intelligence.

However, all these fetchers rely on a browser engine to work. In the final chapter, we will take a deeper look at the engine itselfβ€”the session that keeps track of your cookies, headers, and identity.

Next Chapter: Browser Session Engine


Generated by Code IQ