Chapter 4 · CORE

Crawl Orchestration

📄 04_crawl_orchestration.md 🏷 Core

Chapter 4: Crawl Orchestration

Welcome back!

In Chapter 3: Spider Definition, we created the Mission Plan. We defined a Spider that knows where to go and what to extract.

But having a plan isn't enough. Imagine you want to scrape 10,000 pages.

If you do it too fast, the website will ban you.
If your computer crashes halfway through, you lose all progress.
If you do it one by one, it will take weeks.

In this chapter, we meet the Crawler Engine, the "Traffic Controller" of Scrapling. While the Spider is the employee doing the work, the Engine is the manager ensuring everything runs smoothly, safely, and efficiently.

The Motivation: Managing Chaos

Imagine a busy intersection. Without traffic lights or a policeman, cars would crash, block each other, or get stuck.

In web scraping, the "Traffic" consists of:

Requests: URLs waiting to be visited.
Concurrency: How many pages we download at the same time.
Errors: Websites blocking us or timing out.

The Crawler Engine handles this chaos for you. It ensures you don't accidentally attack a website (Denial of Service) and that you retry failed attempts automatically.

Key Concepts

The Engine works silently in the background, but you can control it using three main levers.

1. Concurrency (Speed Control)

You can tell the Engine: "I want 5 workers downloading pages at the same time."

Too High: You might get banned.
Too Low: The crawl takes forever.

2. Politeness (Download Delay)

You can tell the Engine: "Wait 2 seconds between requests." This makes your bot look more like a human browsing slowly and less like a machine gun.

3. Checkpoints (Save Game)

The Engine can save its progress to a file. If you stop the script or it crashes, you can run it again, and the Engine will say: "Ah, I remember where we were," and resume exactly from that spot.

Controlling the Engine

You don't write code for the Engine directly. Instead, you configure it inside your Spider Definition.

Setting Speed and Delay

Let's modify our Spider to be polite but efficient.

from scrapling import Spider

class PoliteSpider(Spider):
    name = "gentle_bot"
    start_urls = ["https://books.toscrape.com/"]
    
    # 1. Concurrency: Only 2 pages at a time
    concurrent_requests = 2
    
    # 2. Politeness: Wait 1 second between requests
    download_delay = 1.0

What happens here? When you start this spider, the Engine reads these variables. It creates a "capacity limiter" allowing only 2 active connections. Even if you have 1,000 URLs in the queue, it will only process 2 at once, waiting 1 second between them.

Enabling Checkpoints (The Save System)

To enable the "Save Game" feature, you pass a folder path when you start the spider.

spider = PoliteSpider()

# 3. Checkpoints: Save progress to 'my_crawl_state' folder
stats = spider.start(crawldir="./my_crawl_state")

What happens here? The Engine creates a folder called ./my_crawl_state. Every few minutes, it saves a snapshot of the "To-Do List" (Scheduler).

If you hit Ctrl+C to stop, it saves immediately.
If you run the script again with the same crawldir, it skips the start URLs and loads the saved list.

Under the Hood

How does the Engine juggle all these tasks? It runs an Event Loop.

Think of the Engine as a chef in a kitchen.

Scheduler (Order Ticket): Holds the list of URLs (recipes) to cook.
Fetcher (Stove): Actually cooks the food (downloads the page).
Spider (Plating): Arranges the food (parses the data).

The Engine is the Chef shouting: "Next order! Stove 1 is free! Stove 2 is busy!"

sequenceDiagram participant Scheduler as To-Do List participant Engine as Crawler Engine participant Fetcher as Fetcher Session participant Spider as Spider Logic loop The Event Loop Engine->>Scheduler: Get next Request Scheduler-->>Engine: Returns URL Engine->>Engine: Check Concurrency Limit Engine->>Fetcher: Fetch(URL) Fetcher-->>Engine: Return Response (HTML) Engine->>Spider: parse(Response) Spider-->>Engine: Yield New Items & Links Engine->>Scheduler: Add New Links to List end

Internal Code Logic

Let's look at the crawl method inside scrapling/spiders/engine.py. This is the heartbeat of the framework.

It uses an asynchronous TaskGroup (from the anyio library) to run tasks in parallel.

# Simplified logic from scrapling/spiders/engine.py

async def crawl(self):
    self._running = True
    
    # Create a group to manage parallel tasks
    async with create_task_group() as tg:
        while self._running:
            
            # 1. Check if we are doing too much work at once
            if self._active_tasks >= self.spider.concurrent_requests:
                await anyio.sleep(0.01)
                continue

            # 2. Get the next job
            request = await self.scheduler.dequeue()
            
            # 3. Start a background task for this request
            self._active_tasks += 1
            tg.start_soon(self._process_request, request)

The Processor The _process_request function is what runs inside those parallel tasks. It handles the specific lifecycle of one URL.

# Simplified logic from scrapling/spiders/engine.py

async def _process_request(self, request):
    # Respect the politeness delay
    await anyio.sleep(self.spider.download_delay)

    try:
        # 1. Fetch the page (See Chapter 1)
        response = await self.session_manager.fetch(request)
        
        # 2. Check if we got banned
        if await self.spider.is_blocked(response):
            # Schedule a retry later
            return await self.retry_blocked_request(request)

        # 3. Send to Spider for parsing (See Chapter 3)
        async for result in self.spider.parse(response):
            self.handle_result(result)

    finally:
        self._active_tasks -= 1

The Blocked Request Handler

One of the Engine's smartest features is handling blocks. If a website sends a "403 Forbidden" or a Captcha, the Engine doesn't just crash.

It checks spider.is_blocked(response). If true, it:

Takes the request.
Lowers its priority (puts it back in the queue, but not at the front).
Increments a retry counter.
Tries again later, hoping the block was temporary or that a different session/proxy will work.

Summary

In this chapter, you learned about Crawl Orchestration:

Crawler Engine: The manager that controls the flow of the crawl.
Concurrency: How to speed up (or slow down) your spider using concurrent_requests.
Checkpoints: How to use crawldir to pause, resume, and save your progress.
Event Loop: The internal cycle of fetching, parsing, and scheduling.

You now have a fully functional scraping system. But what happens when the website is too complex? What if you need to extract data that requires understanding context, or the HTML is too messy for selectors?

In the next chapter, we will bring in the heavy artillery: Artificial Intelligence.

Next Chapter: AI Integration (MCP)

Generated by Code IQ

← Previous

Spider Definition

AI Integration (MCP)