Welcome to the second chapter of our scikit-learn guide!
In the Base API chapter, we built the "engine" of a machine learning model (our MajorityClassifier). However, an engine without fuel is just a heavy paperweight. In machine learning, that fuel is data.
Imagine you want to practice cooking.
The Problem: Finding, downloading, and formatting data into valid numerical arrays (matrices) takes a long time. Beginners often get stuck here before they even train a model.
The Solution: Scikit-learn includes the datasets module. It allows you to load high-quality "toy" datasets or fetch real-world data with a single line of code.
We want to test a classifier, but we don't have a CSV file handy. We want to load the classic Iris dataset (measurements of flowers) so we can immediately start working with a model.
The datasets module offers three main ways to get data:
load_*): Small "toy" datasets that come installed with the library. They load instantly.load_iris (flowers), load_digits (handwritten numbers).fetch_*): Larger, real-world datasets. The first time you call them, scikit-learn downloads them from the internet and saves them to your computer.fetch_california_housing (house prices).make_*): Functions that use math to create synthetic random data. Great for testing weird scenarios.make_blobs (creates clumps of data points).Let's load the Iris dataset. This dataset contains measurements (features) of 150 iris flowers and their species (labels).
We use the load_iris function.
from sklearn.datasets import load_iris
# Load the dataset
# result is a "Bunch" object (similar to a dictionary)
dataset = load_iris()
# Let's see what is inside
print(list(dataset.keys()))
Output: ['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', ...]
The dataset object holds everything we need:
data: The measurements (the X).target: The species labels (the y).DESCR: A description of the data.
Most of the time, we just want the data matrices (X) and the labels (y) to feed into our fit() function from Chapter 1. We can ask scikit-learn to return just these two.
# return_X_y=True separates the features and labels automatically
X, y = load_iris(return_X_y=True)
# X is a matrix (150 flowers, 4 measurements each)
print(f"X shape: {X.shape}")
# y is a vector (150 labels)
print(f"y shape: {y.shape}")
Output:
X shape: (150, 4)
y shape: (150,)
We now have perfectly formatted numerical arrays ready for a model!
When you call load_iris(), it feels like magic, but it is simply reading a file stored deep inside the scikit-learn installation folder.
Here is what happens when you ask for a toy dataset:
Bunch Object
You noticed the function returned a Bunch. A Bunch is a custom object defined in sklearn/utils/_bunch.py. It is essentially a Python dictionary that allows you to access keys with a dot (.).
Instead of writing data['target'], you can write data.target.
Let's look at a simplified version of how a loader works inside sklearn/datasets/_base.py.
# Simplified logic similar to sklearn/datasets/_base.py
import csv
import numpy as np
from sklearn.utils import Bunch
from os.path import join
def load_simple_data(file_name):
# 1. Locate the CSV file inside the package
module_path = "sklearn/datasets/data" # (Conceptual path)
full_path = join(module_path, file_name)
# 2. Read the data (often using numpy or python csv)
# This converts text files into number arrays
data = np.loadtxt(full_path, delimiter=',')
# 3. separate features (X) and target (y)
# Assuming last column is the target
return Bunch(data=data[:, :-1], target=data[:, -1])
Explanation:
.gzip file.Bunch object.
For larger datasets (fetch_*), the internal code checks a local folder (usually ~/scikit_learn_data) before downloading.
In this chapter, we learned:
load_iris gives us small, instant data.fetch_* downloads larger real-world data.data (X) and target (y).Now that we have the Base API (the blueprint) and Datasets (the fuel), we are ready to build our first mathematical model to actually solve a problem.
Generated by Code IQ