Welcome to Chapter 5!
In Chapter 3: Linear Models, we learned how to predict an answer when we have a "teacher" (labels like Price or Species). In Chapter 4: Metrics, we learned how to measure distances between points.
But what if we have data without any labels? What if we don't have a teacher?
Imagine you run a T-shirt factory. A machine dumps 1,000 shirts into a pile.
You can't use a classifier because you don't have training examples (no y). You just have the raw measurements of the shirts (the X).
The Solution: Clustering. You look at the pile and say, "These ones look similar, I'll put them here. Those ones look huge, I'll put them there."
We are a bank. We have a list of customers with two features:
We want to find distinct "groups" of customers so we can offer them specific credit cards. We don't know what the groups are yet; we just want the algorithm to find them.
Clustering is a type of Unsupervised Learning (learning without labels).
We will use KMeans from scikit-learn.
We'll generate some dummy data representing our customers.
import numpy as np
# We use make_blobs to create clumps of data
from sklearn.datasets import make_blobs
# Generate 3 distinct groups of customers
# X contains [Income, Spending Score]
X, _ = make_blobs(n_samples=15, centers=3, random_state=42)
print(f"Customer 1 data: {X[0]}")
Output: Customer 1 data: [-2.5, 9.0] (Just dummy numbers for now). Note that we ignore the second return value (_) because we pretend we don't know the labels!
We must tell the model how many groups (n_clusters) to look for.
from sklearn.cluster import KMeans
# We suspect there are 3 types of customers
model = KMeans(n_clusters=3, random_state=42)
# Notice: We only pass X! There is no y.
model.fit(X)
Explanation: The model has now analyzed the geometry of the data and found the best 3 spots to place its centers.
Now we can ask the model: "Which group does each customer belong to?"
# Assign each customer to a group (0, 1, or 2)
labels = model.predict(X)
print("Group assignments:", labels)
# Output: [2 1 0 1 2 ...]
Result: The model successfully sorted the customers. All customers labeled 2 are similar to each other, and different from those labeled 0.
We can see the "average" customer for each group.
# The coordinates of the 3 cluster centers
centers = model.cluster_centers_
print("Center of Group 0:", centers[0])
Explanation: This point is the mathematical center of the first cluster.
K-Means is an iterative algorithm. It plays a game of "Hot or Cold" to find the centers.
The heavy lifting happens in sklearn/cluster/_kmeans.py.
However, calculating the distance between every point and every center millions of times is slow in pure Python. Scikit-learn optimizes this using Cython.
The core logic resides in a file called _k_means_common.pyx.
# Simplified concept of the Cython implementation
def k_means_single_step(X, centers):
new_centers = zeros_like(centers)
counts = zeros(n_clusters)
# Iterate over every data point
for i in range(n_samples):
# Find nearest center (e.g., center 0, 1, or 2)
best_center_idx = nearest_center(X[i], centers)
# Add this point's values to the running total for that center
new_centers[best_center_idx] += X[i]
counts[best_center_idx] += 1
# Calculate average (The "Mean" in K-Means)
return new_centers / counts
MiniBatchKMeansIf you have 10 million customers, waiting for the standard K-Means to check every customer before moving the center takes too long.
Scikit-learn offers MiniBatchKMeans.
from sklearn.cluster import MiniBatchKMeans
# Faster version for huge datasets
fast_model = MiniBatchKMeans(n_clusters=3, batch_size=100)
fast_model.fit(X)
In this chapter, we learned:
y)._k_means_common.pyx) or sped up using MiniBatchKMeans.We have successfully grouped our customers! But what if we want to make decisions based on a series of Yes/No questions instead of distances?
In the next chapter, we will learn about Decision Trees.
Generated by Code IQ