Welcome to Chapter 10! In the previous chapter, 4-Classification, we taught a robot how to sort data into specific buckets (like "Cat" or "Dog"). To do that, we had to show the robot thousands of examples that were already labeled.
But what if we have a huge pile of data with no labels?
What if we have a list of music listeners, but we don't know what "Genre" they like? We just know what songs they played. Can the robot organize this messy pile for us without a teacher?
This brings us to the folder 5-Clustering.
Imagine you run a music streaming service.
Clustering is a type of Machine Learning where the robot looks at the data and says, "Hey, these points look like they belong together."
It groups similar items into "blobs" or Clusters.
In Classification (Chapter 9), we used Supervised Learning (we acted as the teacher). In Clustering, we use Unsupervised Learning.
This is the most popular algorithm in this folder.
Imagine throwing a dart at a map. That dart is a Centroid. The algorithm moves this dart around until it sits perfectly in the middle of a group of data points.
To use this folder, we use Scikit-learn inside a notebook (remember notebook.ipynb?).
Let's pretend we have 6 songs. The first number is Speed (BPM), the second is Loudness.
import pandas as pd
# Data: [Speed, Loudness]
songs = [
[120, 10], [125, 9], [130, 8], # Fast & Loud (Techno?)
[60, 2], [65, 3], [55, 1] # Slow & Quiet (Lullabies?)
]
# Create a DataFrame for easier viewing
df = pd.read_csv(songs, columns=['Speed', 'Loudness'])
print("Music Data Ready!")
Explanation: We created a small list. To a human eye, there are clearly two groups here: the "120s" and the "60s". Let's see if the robot sees it too.
We will use KMeans. We have to tell it how many groups we want (n_clusters).
from sklearn.cluster import KMeans
# 1. Initialize the robot
# We ask for 2 groups (clusters)
kmeans = KMeans(n_clusters=2)
# 2. Train the robot on our songs
kmeans.fit(songs)
# 3. Get the labels
print(kmeans.labels_)
Output:
[0 0 0 1 1 1]
Explanation: The robot looked at the data.
0.1.Now we have a new song. Is it Techno or a Lullaby?
# New Song: Speed 128, Loudness 9
new_song = [[128, 9]]
# Ask the robot which group it belongs to
prediction = kmeans.predict(new_song)
print(f"This song belongs to Group: {prediction[0]}")
Output:
This song belongs to Group: 0
Explanation:
Since Group 0 was our fast/loud group, the robot correctly assigned the new song to that playlist.
How does KMeans figure this out? It uses a process of "Guess and Check."
One of the hardest parts of clustering is picking K.
In the 5-Clustering lessons, you will learn a technique called the Elbow Method. You run the robot multiple times with different K values and measure how "messy" the groups are (this is called Inertia).
import matplotlib.pyplot as plt
inertia_list = []
# Try different numbers of clusters (from 1 to 10)
for i in range(1, 11):
km = KMeans(n_clusters=i)
km.fit(songs)
inertia_list.append(km.inertia_)
# We would plot this list to find the "Elbow"
print("Inertia calculated for all K values.")
Explanation:
Clustering is powerful because data is expensive to label.
In this chapter, we explored 5-Clustering. We learned that:
We have covered numbers (Regression), categories (Classification), and groups (Clustering). But humans don't just communicate in numbers; we communicate in words.
How do we teach a computer to read a book or understand a sentence?
Generated by Code IQ