Welcome back!
In Chapter 1: Fetchers Interface, we learned how to drive the vehicle to fetch a web page. In Chapter 2: Adaptive Parser, we learned how to read the map to find the data we need.
Now, we need a Mission Plan.
If you only want to scrape one page, a simple script is fine. But what if you want to scrape an entire online store? You need to visit the home page, find product links, visit each product, extract data, and maybe go to the "Next Page" of results.
Doing this with while loops and manual lists gets messy very quickly.
In this chapter, we introduce the Spider Definition. This is the blueprint that organizes your crawling logic into a neat, reusable package.
Imagine you are running a bakery.
In Scrapling, the Spider class is that recipe. It tells the framework:
Spider Class
A Spider in Scrapling is just a Python class that inherits from Spider. It acts as the brain of your operation.
It usually consists of three main parts:
name: A unique ID for your spider.start_urls: A list of URLs where the crawler should begin.parse(): A function that receives the Response and decides what to do with it.Let's define a spider that visits a bookstore.
from scrapling.spiders import Spider, Response
class BookSpider(Spider):
# 1. Identity
name = "my_first_spider"
# 2. Starting Point
start_urls = ["https://books.toscrape.com/"]
What happened here?
We created a blueprint called BookSpider. We gave it a name and told it: "When you start the engine, go to this URL first." We haven't told it what to do with the page yet.
parse)
When Scrapling fetches the URL in start_urls, it sends the downloaded page (the Response) to a method called parse.
We need to define this method to extract data.
# 3. The Action
async def parse(self, response: Response):
# We use the selector we learned in Chapter 2!
title = response.css("h1::text").get()
# We "yield" the result (hand it over to the engine)
yield {"page_title": title}
Why yield?
In standard Python, return stops a function. But a Spider might find many items on one page. yield allows the Spider to say "Here is one item," and keep working to find more.
A real spider moves across the web. To do this, we don't just yield data; we yield Requests.
async def parse(self, response: Response):
# Find all book links
for link in response.css("article.product_pod h3 a"):
# Tell Scrapling: "Go to this link, then run parse_book"
yield response.follow(link, callback=self.parse_book)
async def parse_book(self, response: Response):
# This runs on the NEW page
yield {"title": response.css("h1::text").get()}
What happened here?
response.follow).parse_book."Now that we have written the blueprint, how do we turn on the machine?
We simply instantiate the class and call start().
# Create an instance of the spider
my_spider = BookSpider()
# Start the engine
result = my_spider.start()
# Check the results
print(f"Items scraped: {len(result.items)}")
Scrapling handles all the complexity in the background: creating the loop, managing network connections, and saving the data.
When you call start(), you aren't just running a function. You are booting up an entire Crawler Engine.
The Spider is passive; it just holds the logic. The Engine is active; it does the work.
Let's peek inside scrapling/spiders/spider.py to see how the Spider connects to the Engine.
When you initialize a Spider, it sets up a Session Manager (which manages the Fetchers we discussed in Chapter 1).
# Simplified from scrapling/spiders/spider.py
class Spider(ABC):
def __init__(self):
# Setup the manager that holds Fetchers
self._session_manager = SessionManager()
# Create a default fetcher session
self.configure_sessions(self._session_manager)
def start(self):
# Create the engine, giving it 'self' (the blueprint)
self._engine = CrawlerEngine(self, self._session_manager)
# Run the crawl
return anyio.run(self.__run)
The start_requests method is the spark plug. By default, it just reads your start_urls list, but you can override it if you need to do complex things like logging in before scraping.
# Simplified from scrapling/spiders/spider.py
async def start_requests(self):
# Loop through the list of strings you provided
for url in self.start_urls:
# Turn them into Request objects
yield Request(url)
The Spider isn't just about parsing; it also allows you to hook into specific moments of the job.
on_start: Runs before the first request. Great for opening database connections.on_close: Runs after the last request finishes. Great for saving reports or sending notifications.on_error: Runs if a request fails (e.g., website down). async def on_start(self, resuming: bool):
print("Starting the engines...")
async def on_close(self):
print("Mission complete. shutting down.")
In this chapter, you learned:
start_urls: The entry points for the crawler.parse(): The function that processes the response and yields data or new requests.yield: How the spider communicates with the engine.You now have a Spider that defines what to do. But who decides when to do it? Who ensures we don't hit the website too hard? Who handles the queues?
That is the job of the Orchestrator.
Next Chapter: Crawl Orchestration
Generated by Code IQ