Welcome back! In the previous chapter, Fetchers Interface, we learned how to drive our vehicles (Fetchers) to the target website and download the raw HTML.
Now that we have the page, we face a new problem: Extraction.
Standard web scraping is brittle. You usually tell your scraper: "Go to the 3rd div, inside the 2nd table, and get the text." But what happens when the website owner decides to add a banner or change the layout? Your scraper breaks.
In this chapter, we introduce the Adaptive Parser, the brain of Scrapling that makes your code resilient to change.
Imagine you are trying to find a specific book in a library.
The core of Scrapling's parsing engine is the Selector class. When you enable Adaptive Mode, it acts like a bloodhound.
Instead of just memorizing the address (CSS or XPath selector) of an element, it memorizes its scent:
class="btn-primary", id="submit"div, which is inside a form.
If the specific address changes (e.g., the ID changes from submit to purchase), Scrapling sniffs out the element that looks most similar to the original one.
Let's start with the basics. The Selector is the wrapper around the HTML code.
This is how most libraries work. You provide the HTML and ask for an element using CSS.
from scrapling import Selector
html = '<html><body><div id="content">Hello World</div></body></html>'
# 1. Load the HTML into the engine
page = Selector(html)
# 2. Extract data using CSS
text = page.css('#content::text').get()
print(text) # Output: Hello World
What happened here?
We wrapped the raw HTML text in a Selector. We used .css() to find the element with id="content" and .get() to extract the text string.
To make our parser smart, we need to turn on the adaptive flag.
# Enable adaptive mode and provide a URL (used as a key for the database)
page = Selector(html, adaptive=True, url="http://example.com")
Now the parser is connected to a lightweight database (SQLite) where it can store element "scents."
The first time you write a scraper, you need to teach Scrapling what the element looks like. We use the argument auto_save=True.
# We want to track this specific element
# Scrapling will analyze it and save its properties to the DB
element = page.css('#content', auto_save=True)
print("Element saved to database!")
What happened here?
Scrapling found the element using #content. Because auto_save was on, it calculated the "scent" (tag is div, text is Hello World, parent is body) and saved it.
Months later, the website updates. They changed the ID from #content to #main-text. Your old selector #content now points to nothing!
In standard libraries, your script crashes. In Scrapling, we use adaptive=True.
# The website changed! ID is now 'main-text'
new_html = '<html><body><div id="main-text">Hello World</div></body></html>'
page = Selector(new_html, adaptive=True, url="http://example.com")
# We try the OLD selector, but enable adaptive backup
# Scrapling sees `#content` is missing, so it looks for the closest match
element = page.css('#content', adaptive=True)
print(element.text) # Output: Hello World
What happened here?
#content. It failed.adaptive=True, it checked the database for what #content used to look like.new_html and found a div with text "Hello World".
How does the Selector make these decisions? It uses a scoring system based on fuzzy matching.
When adaptive=True is triggered, Scrapling performs a "Relocation" process.
Inside the parser.py file, there is a method called __calculate_similarity_score. It doesn't use AI; it uses weighted mathematics.
Here is a simplified explanation of how Scrapling grades an element to see if it's the one you are looking for:
<div>)? (+ Points)difflib for fuzzy string matching).
Let's look at a simplified version of the code that runs when you call relocate.
# Simplified logic from scrapling/parser.py
def relocate(self, saved_traits, percentage=0):
best_match = None
highest_score = 0
# Loop through EVERY element currently on the page
for node in self.get_all_elements():
# Compare saved traits vs current node
score = self.calculate_score(saved_traits, node)
# Keep track of the winner
if score > highest_score:
highest_score = score
best_match = node
# If the winner is good enough, return it
if highest_score >= percentage:
return best_match
return None
This ensures that even if the website changes completely, as long as the data itself (the text or surrounding structure) remains somewhat consistent, Scrapling will find it.
While powerful, the Adaptive Parser is not magic:
In this chapter, you learned:
Selector: The engine that wraps HTML to allow data extraction.Now that we know how to Fetch pages and Parse data adaptively, we need a way to organize these steps into a structured workflow.
Next Chapter: Spider Definition
Generated by Code IQ