Welcome to Chapter 6!
In the previous chapter, Inference & Prediction, we built a "Universal Translator." We successfully loaded our trained model and used it to predict tags for new text.
But here is the big question: Is our model actually smart, or is it just guessing?
Just because the model gives an answer doesn't mean it's right. In this chapter, we will act as the Exam Proctor. We will give our model a "Final Exam" (a dataset it has never seen before) and grade it rigorously to generate a "Report Card."
Imagine a student taking a math class.
But what if the exam was 90% simple addition and 10% complex calculus? If the student got all the addition right but failed every calculus problem, they aren't actually good at mathβthey are just good at addition.
To trust a model in the real world, we need a detailed Report Card:
We don't just use "Accuracy." We use three specific metrics to understand how the model is right or wrong.
To evaluate the model, we use the Test Set. This is data we set aside in Data Processing Pipeline. The model has never seen this data during training.
We load our best model checkpoint (from Chapter 5).
from madewithml import predict
from madewithml.predict import TorchPredictor
# 1. Load the Test Set (The Exam)
ds = ray.data.read_csv("datasets/test_dataset.csv")
# 2. Load the Student (Best Model)
best_checkpoint = predict.get_best_checkpoint(run_id=run_id)
predictor = TorchPredictor.from_checkpoint(best_checkpoint)
We run the predictor on the test set. We then extract two lists:
y_true: The correct answers (from the dataset).y_pred: The student's answers (from the model).import numpy as np
# 1. Get the Answer Key (True Labels)
preprocessor = predictor.get_preprocessor()
preprocessed_ds = preprocessor.transform(ds)
values = preprocessed_ds.select_columns(cols=["targets"]).take_all()
y_true = np.stack([item["targets"] for item in values])
# 2. Get the Student's Answers (Predictions)
predictions = preprocessed_ds.map_batches(predictor).take_all()
y_pred = np.array([d["output"] for d in predictions])
Note: We extract y_pred as a simple list of numbers (e.g., [0, 1, 0...]) to compare easily with the answer key.
Now we calculate the scores. We use a library called scikit-learn to do the heavy math.
This gives us the "Class Average."
from sklearn.metrics import precision_recall_fscore_support
def get_overall_metrics(y_true, y_pred):
# Calculate weighted average of precision, recall, and F1
metrics = precision_recall_fscore_support(y_true, y_pred, average="weighted")
return {
"precision": metrics[0],
"recall": metrics[1],
"f1": metrics[2],
}
This tells us if the model is failing specific subjects.
def get_per_class_metrics(y_true, y_pred, class_to_index):
# Calculate metrics for EACH class separately (average=None)
metrics = precision_recall_fscore_support(y_true, y_pred, average=None)
per_class = {}
for i, _class in enumerate(class_to_index):
per_class[_class] = {
"precision": metrics[0][i],
"recall": metrics[1][i],
"f1": metrics[2][i],
}
return per_class
Sometimes, aggregate metrics hide the truth. A model might be great at "NLP", but terrible at "NLP projects that use LLMs."
Slicing allows us to programmatically define subsets of data and grade them separately. We use a tool called Snorkel to do this easily.
A "Slice" is just a Python function that returns True if a data row belongs to that group.
Example 1: Short Text Does the model fail when the description is too short?
from snorkel.slicing import slicing_function
@slicing_function()
def short_text(x):
"""Projects with titles/descriptions less than 8 words."""
return len(x.text.split()) < 8
Example 2: NLP projects using LLMs Does the model struggle with the specific jargon of Large Language Models?
@slicing_function()
def nlp_llm(x):
"""NLP projects that explicitly mention LLMs or BERT."""
# Check if it is an NLP project
is_nlp = "natural-language-processing" in x.tag
# Check for keywords
has_llm_terms = any(term in x.text.lower() for term in ["llm", "bert"])
return is_nlp and has_llm_terms
We apply these functions to our dataframe. If a row is "Short Text," we check the model's accuracy on just that row.
from snorkel.slicing import PandasSFApplier
def get_slice_metrics(y_true, y_pred, ds):
# 1. Convert dataset to Pandas for easy manipulation
df = ds.to_pandas()
# 2. Apply the slice functions
applier = PandasSFApplier([nlp_llm, short_text])
slices = applier.apply(df)
# 3. Calculate metrics for each slice
# (Logic omitted for brevity: we filter y_true/y_pred by the slice mask)
return slice_metrics
What happens when we run the full evaluation command?
The result is a structured JSON file that we can save or log.
{
"overall": {
"precision": 0.94,
"recall": 0.93,
"f1": 0.93
},
"per_class": {
"computer-vision": { "f1": 0.98 },
"mlops": { "f1": 0.75 }
},
"slices": {
"short_text": { "f1": 0.60 },
"nlp_llm": { "f1": 0.95 }
}
}
Interpretation:
Actionable Insight: We need to collect more data examples of "Short Text" and "MLOps" projects to retrain the model and fix these weaknesses.
We have successfully audited our model. We didn't just accept a single accuracy score; we broke it down by class and by specific data slices.
Now we know exactly where our model is strong and where it is weak. This gives us the confidence to put it into the real world.
But currently, this model only lives in our Python script. If we want to let the whole world use it (like a website or an app), we need to wrap it up as a web service.
π Next Step: Model Serving
Generated by Code IQ