Welcome to Chapter 3! In the previous chapter, Model Architecture, we built our "Chef" (the neural network). We combined a pre-trained BERT model with a classifier.
However, right now, our Chef is an amateur. They have the potential to cook (the architecture), but they haven't practiced yet (the weights are random). We need to train them.
Training a model is like sending an athlete to the gym.
If you have a massive dataset (millions of text files), training on a single computer is like one person trying to do 100,000 pushups. It will take weeks.
Distributed Training is the solution. Instead of one person, we hire a team of athletes (multiple CPUs or GPUs). We split the 100,000 pushups so that 10 people do 10,000 each simultaneously. They finish 10x faster.
This chapter explains how we use a library called Ray to coordinate this team.
We use a specific strategy called Data Parallelism.
We don't want to manually manage cables connecting computers. We use Ray Train.
The main entry point is the TorchTrainer. Think of this as the "Coach" who manages the athletes. We tell the Coach how many workers we want and what training function they should run.
First, we define how big our team is using ScalingConfig.
from ray.train import ScalingConfig
# Define our gym capacity
scaling_config = ScalingConfig(
num_workers=2, # Number of "Athletes"
use_gpu=False, # Are we using GPUs?
resources_per_worker={ # Hardware per athlete
"CPU": 1,
"GPU": 0
}
)
Explanation: Here we are asking for 2 workers, each using 1 CPU. Ray will automatically find these resources on your machine (or cluster).
Next, we initialize the Trainer. We pass in the datasets and the function that contains the training logic (train_loop_per_worker).
from ray.train.torch import TorchTrainer
# Initialize the Coach
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
train_loop_config={"batch_size": 256, "num_epochs": 5},
scaling_config=scaling_config,
datasets={"train": train_ds, "val": val_ds},
)
# Start the workout!
results = trainer.fit()
Explanation: When we call trainer.fit(), Ray spins up the 2 workers we requested, sends the data to them, and executes the training loop.
What exactly happens inside train_loop_per_worker? This is the set of instructions every athlete follows.
Let's look at the flow of a single training step (one batch of data).
We define this logic in madewithml/train.py. Let's break down the code into small, understandable pieces.
When the worker starts, it needs to know which piece of the data belongs to it.
# Inside train_loop_per_worker()
import ray.train as train
def train_loop_per_worker(config):
# 1. Get the specific slice of data for this worker
train_ds = train.get_dataset_shard("train")
# 2. Define the Model (The Chef)
model = FinetunedLLM(...)
# 3. Prepare model for distributed work
model = train.torch.prepare_model(model)
get_dataset_shard: This is magic. If there are 4 workers, Ray ensures this function returns only 1/4th of the data unique to this specific worker.prepare_model: This wraps our PyTorch model so it knows how to communicate with other workers to sync weights.Now we loop through the data. This is the core mathematical "workout."
# Inside a helper function: train_step()
for i, batch in enumerate(ds_generator):
optimizer.zero_grad() # 1. Clear previous calculations
z = model(batch) # 2. Forward Pass (Make a guess)
# 3. Calculate Error (Loss)
# Compare guess (z) vs actual answer (targets)
J = loss_fn(z, batch["targets"])
J.backward() # 4. Backward Pass (Calculate corrections)
optimizer.step() # 5. Update the model's brain
J): A mathematical score of how wrong the model was. High loss = bad guess.After every "epoch" (going through the whole dataset once), we need to check if the model is actually learning or just memorizing. We test it on the Validation Set.
# Inside a helper function: eval_step()
model.eval() # Switch to "Test Mode"
with torch.inference_mode():
for batch in ds_generator:
z = model(batch)
# Calculate loss but DO NOT update weights
J = loss_fn(z, targets)
# Save predictions to compare later
y_preds.extend(torch.argmax(z, dim=1))
model.eval() and torch.inference_mode() to ensure we don't accidentally train on the test data. We just want to measure performance.If our computer crashes after 3 hours of training, we don't want to start over. We save the model's state periodically.
from ray.train import Checkpoint
# Save the model to a temporary folder
model.save(dp=dp)
# Create a Checkpoint object
checkpoint = Checkpoint.from_directory(dp)
# Report progress to the Head Coach
train.report({"loss": val_loss}, checkpoint=checkpoint)
train.report: This sends the current loss score and the saved checkpoint back to the central Ray process. It allows us to track graphs of how our model is improving over time.When we run the command to train our model, here is the sequence of events:
TorchTrainer spawns workers based on our ScalingConfig.We have successfully taken our "amateur" model and put it through a rigorous gym session using Distributed Training. By using Ray, we scaled this process across multiple workers, making it fast and efficient.
But waitβwe picked some random numbers for our training configuration (like the "Learning Rate" or "Batch Size"). How do we know those were the best numbers to use? Maybe a different learning rate would make the model smarter?
To answer that, we need to run experiments to find the perfect settings.
π Next Step: Hyperparameter Tuning
Generated by Code IQ