Welcome to Chapter 10!
In Chapter 9: Column Transformer, we learned how to process messy data (numbers and text) into a clean matrix. In earlier chapters like Chapter 3: Linear Models, we learned how to train a model.
Up until now, we have been doing these as separate steps.
This works for homework, but in the real world, it is messy and dangerous.
Imagine you own a sandwich shop.
The Problem:
transform() every time you get new data.
The Solution: The Pipeline. It bundles all your preprocessing steps and your final model into a single object.
We want to build a classifier that:
We want to treat this entire process as one single model.
A Pipeline is a list of steps performed in order.
fit and transform method.StandardScaler, PCA, or our ColumnTransformer from the previous chapter.fit.fit() on the Pipeline, it automatically manages the flow of data through the steps.Let's chain a Scaler and a Classifier together.
We import the parts we need.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Dummy data: 3 samples, 2 features
X_train = [[10.0, 200.0], [10.5, 210.0], [50.0, 500.0]]
y_train = [0, 0, 1]
A Pipeline is defined as a list of tuples: ('Name', Object).
# Create the "conveyor belt"
pipe = Pipeline([
('scaler', StandardScaler()), # Step 1: Scale
('clf', LogisticRegression()) # Step 2: Classify
])
This is the magic part. We treat pipe exactly like a standard model.
# We feed RAW data into the pipeline
# It scales it internally, then trains the classifier
pipe.fit(X_train, y_train)
print("Pipeline is trained!")
Now we have new data. We do not need to scale it manually. The pipeline remembers the scale from the training step and applies it automatically.
# New, raw data
X_new = [[12.0, 205.0]]
# The pipeline scales this, then predicts
prediction = pipe.predict(X_new)
print(f"Prediction: {prediction[0]}")
# Output: 0
Result: Clean code, no manual variable handling, and no risk of forgetting a step!
While Pipeline chains steps sequentially (Step 1 -> Step 2), there is a sibling called FeatureUnion that runs steps parallel (Step A and Step B at the same time) and joins the results.
This is similar to the ColumnTransformer we saw in Chapter 9, but more general.
The Pipeline acts as a traffic controller. It differentiates between Training (fit) and Predicting (predict).
fit)
When you call pipe.fit(X, y):
predict)
When you call pipe.predict(X):
The code for this resides in sklearn/pipeline.py. It is essentially a loop that passes the output of one step as the input to the next.
Here is a simplified Python conceptualization of the fit method inside the Pipeline class:
# Simplified logic from sklearn/pipeline.py
class SimplePipeline:
def __init__(self, steps):
self.steps = steps # List of (name, object)
def fit(self, X, y):
# 1. Loop through all steps EXCEPT the last one
for name, transformer in self.steps[:-1]:
# Fit this step and transform the data
# The output (Xt) becomes the input for the next loop
X = transformer.fit_transform(X, y)
# 2. Get the final step (the model)
last_step = self.steps[-1][1]
# 3. Fit the model using the fully transformed data
last_step.fit(X, y)
return self
Explanation:
for loop iterates through your transformers (like Scalers).X at every step. The X that leaves step 1 enters step 2.X reaches the Estimator.
Sometimes, steps need extra information (like "sample weights" for weighted scoring). Modern scikit-learn pipelines support Metadata Routing. This allows you to pass extra arguments like sample_weight into fit(), and the Pipeline intelligently routes them only to the steps that requested them.
In this chapter, we learned:
fit and strictly applied during predict.pipe.fit() and pipe.predict().Now we have built sophisticated Pipelines. But how do we know they are robust? How do we verify that our code doesn't break when edge cases happen?
We need to learn how to test our tools.
Next Chapter: Testing Utilities
Generated by Code IQ