Chapter 3 ยท CORE

Spider Definition

๐Ÿ“„ 03_spider_definition.md ๐Ÿท Core

Chapter 3: Spider Definition

Welcome back!

In Chapter 1: Fetchers Interface, we learned how to drive the vehicle to fetch a web page. In Chapter 2: Adaptive Parser, we learned how to read the map to find the data we need.

Now, we need a Mission Plan.

If you only want to scrape one page, a simple script is fine. But what if you want to scrape an entire online store? You need to visit the home page, find product links, visit each product, extract data, and maybe go to the "Next Page" of results.

Doing this with while loops and manual lists gets messy very quickly.

In this chapter, we introduce the Spider Definition. This is the blueprint that organizes your crawling logic into a neat, reusable package.

The Motivation: From Script to Blueprint

Imagine you are running a bakery.

In Scrapling, the Spider class is that recipe. It tells the framework:

  1. Where to start (The ingredients).
  2. What to do when a page arrives (The baking steps).
  3. Where to go next (The next batch).

The Core Concept: The Spider Class

A Spider in Scrapling is just a Python class that inherits from Spider. It acts as the brain of your operation.

It usually consists of three main parts:

  1. name: A unique ID for your spider.
  2. start_urls: A list of URLs where the crawler should begin.
  3. parse(): A function that receives the Response and decides what to do with it.

Step 1: The Basic Setup

Let's define a spider that visits a bookstore.

from scrapling.spiders import Spider, Response

class BookSpider(Spider):
    # 1. Identity
    name = "my_first_spider"
    
    # 2. Starting Point
    start_urls = ["https://books.toscrape.com/"]

What happened here? We created a blueprint called BookSpider. We gave it a name and told it: "When you start the engine, go to this URL first." We haven't told it what to do with the page yet.

Step 2: The Logic (parse)

When Scrapling fetches the URL in start_urls, it sends the downloaded page (the Response) to a method called parse.

We need to define this method to extract data.

    # 3. The Action
    async def parse(self, response: Response):
        # We use the selector we learned in Chapter 2!
        title = response.css("h1::text").get()
        
        # We "yield" the result (hand it over to the engine)
        yield {"page_title": title}

Why yield? In standard Python, return stops a function. But a Spider might find many items on one page. yield allows the Spider to say "Here is one item," and keep working to find more.

A real spider moves across the web. To do this, we don't just yield data; we yield Requests.

    async def parse(self, response: Response):
        # Find all book links
        for link in response.css("article.product_pod h3 a"):
            # Tell Scrapling: "Go to this link, then run parse_book"
            yield response.follow(link, callback=self.parse_book)

    async def parse_book(self, response: Response):
        # This runs on the NEW page
        yield {"title": response.css("h1::text").get()}

What happened here?

  1. The Spider lands on the homepage.
  2. It finds links to books.
  3. Instead of data, it hands Scrapling a "To-Do" note (response.follow).
  4. It says: "Fetch this link, and when you are done, send the result to parse_book."

Use Case: Running the Spider

Now that we have written the blueprint, how do we turn on the machine?

We simply instantiate the class and call start().

# Create an instance of the spider
my_spider = BookSpider()

# Start the engine
result = my_spider.start()

# Check the results
print(f"Items scraped: {len(result.items)}")

Scrapling handles all the complexity in the background: creating the loop, managing network connections, and saving the data.

Under the Hood

When you call start(), you aren't just running a function. You are booting up an entire Crawler Engine.

The Spider is passive; it just holds the logic. The Engine is active; it does the work.

sequenceDiagram participant Engine as Crawler Engine participant Spider as Your Spider Code participant Web as Internet Engine->>Spider: Get start_urls Spider-->>Engine: Returns "http://books.toscrape.com" loop Every Request Engine->>Web: Fetch URL Web-->>Engine: Return HTML Engine->>Spider: Call parse(Response) Spider-->>Engine: Yield Data OR New Links opt If New Links Engine->>Engine: Add to To-Do List (Scheduler) end end

Internal Code Logic

Let's peek inside scrapling/spiders/spider.py to see how the Spider connects to the Engine.

When you initialize a Spider, it sets up a Session Manager (which manages the Fetchers we discussed in Chapter 1).

# Simplified from scrapling/spiders/spider.py

class Spider(ABC):
    def __init__(self):
        # Setup the manager that holds Fetchers
        self._session_manager = SessionManager()
        # Create a default fetcher session
        self.configure_sessions(self._session_manager)

    def start(self):
        # Create the engine, giving it 'self' (the blueprint)
        self._engine = CrawlerEngine(self, self._session_manager)
        # Run the crawl
        return anyio.run(self.__run)

The start_requests method is the spark plug. By default, it just reads your start_urls list, but you can override it if you need to do complex things like logging in before scraping.

# Simplified from scrapling/spiders/spider.py

async def start_requests(self):
    # Loop through the list of strings you provided
    for url in self.start_urls:
        # Turn them into Request objects
        yield Request(url)

Lifecycle Hooks

The Spider isn't just about parsing; it also allows you to hook into specific moments of the job.

  1. on_start: Runs before the first request. Great for opening database connections.
  2. on_close: Runs after the last request finishes. Great for saving reports or sending notifications.
  3. on_error: Runs if a request fails (e.g., website down).
    async def on_start(self, resuming: bool):
        print("Starting the engines...")

    async def on_close(self):
        print("Mission complete. shutting down.")

Summary

In this chapter, you learned:

  1. The Spider Class: The blueprint that organizes your scraping logic.
  2. start_urls: The entry points for the crawler.
  3. parse(): The function that processes the response and yields data or new requests.
  4. yield: How the spider communicates with the engine.

You now have a Spider that defines what to do. But who decides when to do it? Who ensures we don't hit the website too hard? Who handles the queues?

That is the job of the Orchestrator.

Next Chapter: Crawl Orchestration


Generated by Code IQ