Welcome back!
In Chapter 3: Spider Definition, we created the Mission Plan. We defined a Spider that knows where to go and what to extract.
But having a plan isn't enough. Imagine you want to scrape 10,000 pages.
In this chapter, we meet the Crawler Engine, the "Traffic Controller" of Scrapling. While the Spider is the employee doing the work, the Engine is the manager ensuring everything runs smoothly, safely, and efficiently.
Imagine a busy intersection. Without traffic lights or a policeman, cars would crash, block each other, or get stuck.
In web scraping, the "Traffic" consists of:
The Crawler Engine handles this chaos for you. It ensures you don't accidentally attack a website (Denial of Service) and that you retry failed attempts automatically.
The Engine works silently in the background, but you can control it using three main levers.
You can tell the Engine: "I want 5 workers downloading pages at the same time."
You can tell the Engine: "Wait 2 seconds between requests." This makes your bot look more like a human browsing slowly and less like a machine gun.
The Engine can save its progress to a file. If you stop the script or it crashes, you can run it again, and the Engine will say: "Ah, I remember where we were," and resume exactly from that spot.
You don't write code for the Engine directly. Instead, you configure it inside your Spider Definition.
Let's modify our Spider to be polite but efficient.
from scrapling import Spider
class PoliteSpider(Spider):
name = "gentle_bot"
start_urls = ["https://books.toscrape.com/"]
# 1. Concurrency: Only 2 pages at a time
concurrent_requests = 2
# 2. Politeness: Wait 1 second between requests
download_delay = 1.0
What happens here? When you start this spider, the Engine reads these variables. It creates a "capacity limiter" allowing only 2 active connections. Even if you have 1,000 URLs in the queue, it will only process 2 at once, waiting 1 second between them.
To enable the "Save Game" feature, you pass a folder path when you start the spider.
spider = PoliteSpider()
# 3. Checkpoints: Save progress to 'my_crawl_state' folder
stats = spider.start(crawldir="./my_crawl_state")
What happens here?
The Engine creates a folder called ./my_crawl_state. Every few minutes, it saves a snapshot of the "To-Do List" (Scheduler).
Ctrl+C to stop, it saves immediately.crawldir, it skips the start URLs and loads the saved list.How does the Engine juggle all these tasks? It runs an Event Loop.
Think of the Engine as a chef in a kitchen.
The Engine is the Chef shouting: "Next order! Stove 1 is free! Stove 2 is busy!"
Let's look at the crawl method inside scrapling/spiders/engine.py. This is the heartbeat of the framework.
It uses an asynchronous TaskGroup (from the anyio library) to run tasks in parallel.
# Simplified logic from scrapling/spiders/engine.py
async def crawl(self):
self._running = True
# Create a group to manage parallel tasks
async with create_task_group() as tg:
while self._running:
# 1. Check if we are doing too much work at once
if self._active_tasks >= self.spider.concurrent_requests:
await anyio.sleep(0.01)
continue
# 2. Get the next job
request = await self.scheduler.dequeue()
# 3. Start a background task for this request
self._active_tasks += 1
tg.start_soon(self._process_request, request)
The Processor
The _process_request function is what runs inside those parallel tasks. It handles the specific lifecycle of one URL.
# Simplified logic from scrapling/spiders/engine.py
async def _process_request(self, request):
# Respect the politeness delay
await anyio.sleep(self.spider.download_delay)
try:
# 1. Fetch the page (See Chapter 1)
response = await self.session_manager.fetch(request)
# 2. Check if we got banned
if await self.spider.is_blocked(response):
# Schedule a retry later
return await self.retry_blocked_request(request)
# 3. Send to Spider for parsing (See Chapter 3)
async for result in self.spider.parse(response):
self.handle_result(result)
finally:
self._active_tasks -= 1
One of the Engine's smartest features is handling blocks. If a website sends a "403 Forbidden" or a Captcha, the Engine doesn't just crash.
It checks spider.is_blocked(response). If true, it:
In this chapter, you learned about Crawl Orchestration:
concurrent_requests.crawldir to pause, resume, and save your progress.You now have a fully functional scraping system. But what happens when the website is too complex? What if you need to extract data that requires understanding context, or the HTML is too messy for selectors?
In the next chapter, we will bring in the heavy artillery: Artificial Intelligence.
Next Chapter: AI Integration (MCP)
Generated by Code IQ