Welcome to the Made With ML project!
Before we can build intelligent machines, we need to speak their language. This chapter focuses on the Data Processing Pipeline.
Imagine you are a chef in a high-end restaurant. You want to cook a delicious meal (train a model). However, your ingredients (data) just arrived from the farm: the potatoes are covered in dirt, the carrots have stems, and everything is in different sizes.
You cannot just throw a dirty potato into the oven. You need a Prep Kitchen.
In Machine Learning, this is often called ETL (Extract, Transform, Load). Our goal is to turn raw text files into numerical "tensors" that our model can understand.
First, we need to load our dataset. We are using a CSV file containing titles, descriptions, and tags for various machine learning projects.
We use a library called Ray to load data efficiently, which will help us scale up later.
import ray
# Load the data from a CSV file
ds = ray.data.read_csv("datasets/tags.csv")
# Shuffle the data (randomize the order)
ds = ds.random_shuffle(seed=1234)
# Take a look at the first item
print(ds.take(1))
Result: A raw list of rows containing text and tags.
Before we process the data, we must split it.
We use stratified splitting. This ensures that if 10% of our data is about "computer-vision", our training set and test set both have exactly 10% "computer-vision" examples.
from madewithml.data import stratify_split
# Split into train (80%) and test (20%)
# Stratify ensures balanced classes based on the "tag" column
train_ds, test_ds = stratify_split(ds, stratify="tag", test_size=0.2)
Raw text is messy. It contains capitalization, punctuation, and "stopwords" (common words like "the", "and", "is" that don't add unique meaning to the topic).
We need to scrub the text clean.
import re
def clean_text(text):
# Lowercase everything
text = text.lower()
# Remove special characters (keep only alphanumeric)
text = re.sub("[^A-Za-z0-9]+", " ", text)
# Remove stopwords (simplified example)
text = text.replace(" the ", " ")
return text.strip()
# Example input: "The Great Computer-Vision Project!"
# Example output: "great computer vision project"
This is the most critical step. Models do not understand words; they understand numbers. Tokenization converts words into numerical IDs.
We use a Tokenizer (specifically one called BERT). It looks up each word in a massive dictionary and replaces it with a unique ID number.
from transformers import BertTokenizer
# Load a pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
# Convert text to numbers
text = "computer vision"
output = tokenizer(text, return_tensors="np")
print(output["input_ids"])
# Output might look like: [101, 3452, 8910, 102]
Note: The extra numbers at the start (101) and end (102) are special "start" and "end" markers required by the model.
To keep our code clean, we wrap all these steps into a single class called CustomPreprocessor. This acts as our "Head Chef" in the prep kitchen, ensuring every piece of data is treated exactly the same way.
The preprocessor handles:
0).Here is what happens when we run our pipeline:
In our file madewithml/data.py, we define the preprocess function which ties the cleaning and tokenization together.
# From madewithml/data.py
def preprocess(df, class_to_index):
# 1. Feature Engineering: Combine title and description
df["text"] = df.title + " " + df.description
# 2. Clean the text
df["text"] = df.text.apply(clean_text)
# 3. Convert tags to numbers (Label Encoding)
df["tag"] = df["tag"].map(class_to_index)
# 4. Tokenize
outputs = tokenize(df)
return outputs
Finally, the CustomPreprocessor class manages the whole flow. It learns the tags during fit (on training data) and applies the changes during transform.
class CustomPreprocessor:
def fit(self, ds):
# Learn all unique tags from the dataset
tags = ds.unique(column="tag")
# Create a dictionary map: {'computer-vision': 0, 'mlops': 1...}
self.class_to_index = {tag: i for i, tag in enumerate(tags)}
return self
def transform(self, ds):
# Apply the preprocess function to the dataset
return ds.map_batches(
preprocess,
fn_kwargs={"class_to_index": self.class_to_index}
)
Congratulations! You have successfully built a pipeline that takes raw, messy text and converts it into clean, organized numerical data.
In our kitchen analogy, the ingredients are washed, peeled, chopped, and measured. They are now ready for the Chef.
In the next chapter, we will build the "Chef"βthe neural network itself.
π Next Step: Model Architecture
Generated by Code IQ