Chapter 10 ยท CORE

5-Clustering

๐Ÿ“„ 10_5_clustering.md ๐Ÿท Core

Chapter 10: 5-Clustering

Welcome to Chapter 10! In the previous chapter, 4-Classification, we taught a robot how to sort data into specific buckets (like "Cat" or "Dog"). To do that, we had to show the robot thousands of examples that were already labeled.

But what if we have a huge pile of data with no labels?

What if we have a list of music listeners, but we don't know what "Genre" they like? We just know what songs they played. Can the robot organize this messy pile for us without a teacher?

This brings us to the folder 5-Clustering.

Motivation: The Music Playlist

Imagine you run a music streaming service.

Clustering is a type of Machine Learning where the robot looks at the data and says, "Hey, these points look like they belong together."

It groups similar items into "blobs" or Clusters.

Key Concepts: Unsupervised Learning

In Classification (Chapter 9), we used Supervised Learning (we acted as the teacher). In Clustering, we use Unsupervised Learning.

1. K-Means

This is the most popular algorithm in this folder.

2. Centroids

Imagine throwing a dart at a map. That dart is a Centroid. The algorithm moves this dart around until it sits perfectly in the middle of a group of data points.

How to Use This Abstraction

To use this folder, we use Scikit-learn inside a notebook (remember notebook.ipynb?).

Step 1: Create Dummy Data

Let's pretend we have 6 songs. The first number is Speed (BPM), the second is Loudness.

import pandas as pd

# Data: [Speed, Loudness]
songs = [
    [120, 10], [125, 9], [130, 8],  # Fast & Loud (Techno?)
    [60, 2], [65, 3], [55, 1]       # Slow & Quiet (Lullabies?)
]

# Create a DataFrame for easier viewing
df = pd.read_csv(songs, columns=['Speed', 'Loudness'])
print("Music Data Ready!")

Explanation: We created a small list. To a human eye, there are clearly two groups here: the "120s" and the "60s". Let's see if the robot sees it too.

Step 2: The Clustering Robot

We will use KMeans. We have to tell it how many groups we want (n_clusters).

from sklearn.cluster import KMeans

# 1. Initialize the robot
# We ask for 2 groups (clusters)
kmeans = KMeans(n_clusters=2)

# 2. Train the robot on our songs
kmeans.fit(songs)

# 3. Get the labels
print(kmeans.labels_)

Output:

[0 0 0 1 1 1]

Explanation: The robot looked at the data.

Step 3: Predicting a New Song

Now we have a new song. Is it Techno or a Lullaby?

# New Song: Speed 128, Loudness 9
new_song = [[128, 9]]

# Ask the robot which group it belongs to
prediction = kmeans.predict(new_song)

print(f"This song belongs to Group: {prediction[0]}")

Output:

This song belongs to Group: 0

Explanation: Since Group 0 was our fast/loud group, the robot correctly assigned the new song to that playlist.

The Internal Structure: Under the Hood

How does KMeans figure this out? It uses a process of "Guess and Check."

  1. It picks 2 random spots on the chart.
  2. It asks every song: "Which spot are you closer to?"
  3. It moves the spots to be closer to their new friends.
  4. It repeats this until the spots stop moving.
sequenceDiagram participant Data as Songs participant Robot as K-Means Algo participant Center as Centroid (The Dart) Robot->>Center: "Start at random location (0,0)" Data->>Robot: "We are mostly located at (100, 10)!" Note right of Robot: Robot realizes the center is too far away. Robot->>Center: "Move closer to the data at (50, 5)" Robot->>Center: "Move closer... (90, 9)" Center-->>Data: "I am now in the middle of the group." Robot-->>Data: "Clustering Complete."

Deep Dive: The Elbow Method

One of the hardest parts of clustering is picking K.

In the 5-Clustering lessons, you will learn a technique called the Elbow Method. You run the robot multiple times with different K values and measure how "messy" the groups are (this is called Inertia).

import matplotlib.pyplot as plt

inertia_list = []

# Try different numbers of clusters (from 1 to 10)
for i in range(1, 11):
    km = KMeans(n_clusters=i)
    km.fit(songs)
    inertia_list.append(km.inertia_)

# We would plot this list to find the "Elbow"
print("Inertia calculated for all K values.")

Explanation:

Why this matters for Beginners

Clustering is powerful because data is expensive to label.

Conclusion

In this chapter, we explored 5-Clustering. We learned that:

We have covered numbers (Regression), categories (Classification), and groups (Clustering). But humans don't just communicate in numbers; we communicate in words.

How do we teach a computer to read a book or understand a sentence?

Next Chapter: 6-NLP


Generated by Code IQ