Welcome to Chapter 3!
In Chapter 1: Base API, we learned how to build a model structure. In Chapter 2: Datasets, we learned how to load data. Now, we are going to combine them to build our first real machine learning models using math.
Imagine you are trying to predict the price of a house based on its size. You plot your data on a chart:
The Problem: You have a new house size, but no price tag. How do you guess the price?
The Solution: You take a ruler and draw a straight line through the middle of your data points. To predict the price, you just look at where the new house size falls on that line.
In scikit-learn, this family of algorithms is called Linear Models. They are simple, fast, and often the first thing you should try.
We will look at two problems:
Linear models are all about finding the best formula that looks like this:
$$ y = w \times X + b $$
Don't worry about the math symbols! Here is the translation:
The model's job is to figure out the best $w$ and $b$ so the line fits your data perfectly.
Let's predict pizza prices based on their diameter (in inches).
We have 3 pizzas.
from sklearn.linear_model import LinearRegression
import numpy as np
# X must be a 2D array (list of lists)
X_train = [[6], [8], [10]]
y_train = [7, 9, 13]
We instantiate the model and call fit(). The model will now try to draw the best line through these three points.
# Create the "Ruler"
model = LinearRegression()
# Find the best line
model.fit(X_train, y_train)
Now we ask: How much should a 12-inch pizza cost?
# Predict for 12 inches
prediction = model.predict([[12]])
print(f"Predicted Price: ${prediction[0]:.2f}")
# Output: Predicted Price: $14.33
Explanation: The model looked at the trend and extended the line to 12 inches.
We can peek inside to see the math the model learned.
print(f"Weight (w): {model.coef_[0]}")
print(f"Bias (b): {model.intercept_}")
Now, let's predict if a student passes an exam based on hours studied.
from sklearn.linear_model import LogisticRegression
# 0 = Fail, 1 = Pass
X_train = [[1], [2], [4], [5]]
y_train = [0, 0, 1, 1]
This creates a "Decision Boundary." If you fall on one side of the line, you are "Fail"; on the other, "Pass."
clf = LogisticRegression()
clf.fit(X_train, y_train)
# Predict for someone who studied 3 hours
result = clf.predict([[3]])
print(f"Prediction: {'Pass' if result[0] == 1 else 'Fail'}")
Output: Likely Fail (depending on the exact math boundary), as 3 is closer to the failing students (1 and 2) than the passing ones (4 and 5).
You might wonder: How does the computer know exactly where to put the line? Why didn't it draw a different one?
This involves an Optimizer. The model plays a game called "Minimize the Error."
_cd_fast.pyxFor simple Linear Regression, there is a direct mathematical formula to find the answer instantly. But for more complex models (like Lasso or ElasticNet), scikit-learn uses an algorithm called Coordinate Descent.
Imagine you are tuning a radio that has 100 knobs (weights). You want to get the clearest signal (lowest error).
This "one knob at a time" approach is very fast, but doing it in Python loops is slow.
To make this fast, scikit-learn implements this loop in Cython (C-Extension for Python). The file is located at sklearn/linear_model/_cd_fast.pyx.
Here is a simplified Python representation of what happens inside that C-file:
# Conceptual logic of Coordinate Descent
def coordinate_descent(X, y, weights, n_iterations):
n_features = X.shape[1]
for i in range(n_iterations):
# Loop over each feature (knob) one by one
for feature_idx in range(n_features):
# 1. Calculate error with current weights
current_prediction = predict(X, weights)
error = y - current_prediction
# 2. Update ONLY this specific weight to minimize error
# (Math simplified for clarity)
weights[feature_idx] += correlation(error, X[:, feature_idx])
return weights
Explanation:
_cd_fast.pyx is compiled to machine code, this loop runs millions of times per second, allowing scikit-learn to fit huge datasets quickly.In this chapter, we learned:
Now that our model has made predictions, how do we know if they are actually correct? Is being off by $2 good or bad?
We will find out in the next chapter.
Generated by Code IQ