Chapter 2 ยท CORE

Adaptive Parser

๐Ÿ“„ 02_adaptive_parser.md ๐Ÿท Core

Chapter 2: Adaptive Parser

Welcome back! In the previous chapter, Fetchers Interface, we learned how to drive our vehicles (Fetchers) to the target website and download the raw HTML.

Now that we have the page, we face a new problem: Extraction.

Standard web scraping is brittle. You usually tell your scraper: "Go to the 3rd div, inside the 2nd table, and get the text." But what happens when the website owner decides to add a banner or change the layout? Your scraper breaks.

In this chapter, we introduce the Adaptive Parser, the brain of Scrapling that makes your code resilient to change.

The Motivation: The Brittle Scraper

Imagine you are trying to find a specific book in a library.

The Concept: The Smart Bloodhound

The core of Scrapling's parsing engine is the Selector class. When you enable Adaptive Mode, it acts like a bloodhound.

Instead of just memorizing the address (CSS or XPath selector) of an element, it memorizes its scent:

  1. Text Content: "Buy Now"
  2. Attributes: class="btn-primary", id="submit"
  3. Hierarchy: It is inside a div, which is inside a form.
  4. Neighbors: It is right after the "Price" tag.

If the specific address changes (e.g., the ID changes from submit to purchase), Scrapling sniffs out the element that looks most similar to the original one.

Using the Selector

Let's start with the basics. The Selector is the wrapper around the HTML code.

1. Basic Selection (The Traditional Way)

This is how most libraries work. You provide the HTML and ask for an element using CSS.

from scrapling import Selector

html = '<html><body><div id="content">Hello World</div></body></html>'

# 1. Load the HTML into the engine
page = Selector(html)

# 2. Extract data using CSS
text = page.css('#content::text').get()

print(text) # Output: Hello World

What happened here? We wrapped the raw HTML text in a Selector. We used .css() to find the element with id="content" and .get() to extract the text string.

2. Enabling Adaptive Mode

To make our parser smart, we need to turn on the adaptive flag.

# Enable adaptive mode and provide a URL (used as a key for the database)
page = Selector(html, adaptive=True, url="http://example.com")

Now the parser is connected to a lightweight database (SQLite) where it can store element "scents."

3. The "Save" Phase (Teaching)

The first time you write a scraper, you need to teach Scrapling what the element looks like. We use the argument auto_save=True.

# We want to track this specific element
# Scrapling will analyze it and save its properties to the DB
element = page.css('#content', auto_save=True)

print("Element saved to database!")

What happened here? Scrapling found the element using #content. Because auto_save was on, it calculated the "scent" (tag is div, text is Hello World, parent is body) and saved it.

4. The "Retrieve" Phase (Relocating)

Months later, the website updates. They changed the ID from #content to #main-text. Your old selector #content now points to nothing!

In standard libraries, your script crashes. In Scrapling, we use adaptive=True.

# The website changed! ID is now 'main-text'
new_html = '<html><body><div id="main-text">Hello World</div></body></html>'
page = Selector(new_html, adaptive=True, url="http://example.com")

# We try the OLD selector, but enable adaptive backup
# Scrapling sees `#content` is missing, so it looks for the closest match
element = page.css('#content', adaptive=True)

print(element.text) # Output: Hello World

What happened here?

  1. Scrapling tried to find #content. It failed.
  2. Because adaptive=True, it checked the database for what #content used to look like.
  3. It scanned the new_html and found a div with text "Hello World".
  4. It calculated a similarity score (e.g., 95% match) and returned that element instead.

Under the Hood

How does the Selector make these decisions? It uses a scoring system based on fuzzy matching.

When adaptive=True is triggered, Scrapling performs a "Relocation" process.

sequenceDiagram participant User participant Selector participant DB as Storage (SQLite) participant Logic as Similarity Engine User->>Selector: css('#old-id', adaptive=True) Selector->>Selector: Try finding '#old-id' alt Element Not Found Selector->>DB: Retrieve traits for '#old-id' DB-->>Selector: Return traits (Text, Tag, Attributes) Selector->>Logic: Compare traits vs ALL elements on page Logic-->>Selector: Return element with highest score (e.g. 98%) end Selector-->>User: Return the found element

The Similarity Score

Inside the parser.py file, there is a method called __calculate_similarity_score. It doesn't use AI; it uses weighted mathematics.

Here is a simplified explanation of how Scrapling grades an element to see if it's the one you are looking for:

  1. Tag Match: Is it the same tag (e.g., <div>)? (+ Points)
  2. Text Match: Is the text similar? (Uses difflib for fuzzy string matching).
  3. Attribute Match: Do classes/IDs overlap?
  4. Hierarchy Match: Does it have the same parent and siblings?

Internal Code Logic

Let's look at a simplified version of the code that runs when you call relocate.

# Simplified logic from scrapling/parser.py

def relocate(self, saved_traits, percentage=0):
    best_match = None
    highest_score = 0
    
    # Loop through EVERY element currently on the page
    for node in self.get_all_elements():
        
        # Compare saved traits vs current node
        score = self.calculate_score(saved_traits, node)
        
        # Keep track of the winner
        if score > highest_score:
            highest_score = score
            best_match = node

    # If the winner is good enough, return it
    if highest_score >= percentage:
        return best_match
    return None

This ensures that even if the website changes completely, as long as the data itself (the text or surrounding structure) remains somewhat consistent, Scrapling will find it.

Limitations

While powerful, the Adaptive Parser is not magic:

  1. Uniqueness: It works best if the element has distinct text or structure. If you have a list of 100 identical items, it might struggle to pick the specific one you wanted without a strict selector.
  2. Performance: Comparing one element against every element on the page takes slightly more CPU time than a direct lookup.

Summary

In this chapter, you learned:

  1. Selector: The engine that wraps HTML to allow data extraction.
  2. Adaptive Parsing: The ability to find elements based on "scent" (traits) rather than just "address" (selectors).
  3. Auto-Save: How to teach Scrapling what an element looks like.
  4. Relocation: How Scrapling recovers data when the website layout changes.

Now that we know how to Fetch pages and Parse data adaptively, we need a way to organize these steps into a structured workflow.

Next Chapter: Spider Definition


Generated by Code IQ