Concept #13Mediumgen-ai-fundamentals

What metrics would you track during model training?

#gen-ai#mlops#training

Answer

Training Metrics & Monitoring

Monitoring a fine-tuning run effectively separates a successful model from a wasted GPU budget.

Core Training Metrics

MetricWhat it measuresHealthy range
Training lossError on training data (cross-entropy)Steadily decreasing
Validation lossError on held-out dataDecreasing, then plateauing
Perplexityexp(loss) — model's "surprise"Lower is better
Learning rateStep size for gradient updatesFollows schedule (warmup → decay)
Gradient normMagnitude of gradients< 1.0 (after clipping)
GPU utilisationHardware efficiency> 80%
Tokens/secondTraining throughputMaximise this

Overfitting Detection

text
Training loss:    2.1 → 1.8 → 1.5 → 1.3 → 1.1 (good)
Validation loss:  2.2 → 1.9 → 1.7 → 1.8 → 2.1 (overfitting — val loss increases!)

Fix: Early stopping, increase dropout, reduce epochs, add more training data.

Using WandB for Experiment Tracking

python
import wandb
from transformers import TrainingArguments, Trainer

wandb.init(
    project="llm-fine-tuning",
    name="llama2-7b-instruction-v1",
    config={
        "model": "meta-llama/Llama-2-7b-hf",
        "lora_r": 16,
        "learning_rate": 2e-4,
        "epochs": 3,
    }
)

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,   # Effective batch = 32
    learning_rate=2e-4,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    logging_steps=10,
    eval_steps=100,
    save_steps=500,
    report_to="wandb",               # Automatic WandB logging
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

Common Training Problems & Fixes

ProblemSymptomFix
OverfittingVal loss increases while train loss decreasesReduce epochs, add dropout, more data
Loss explosionLoss suddenly goes to NaN or very highReduce LR, clip gradients, check data
Gradient norm explosionInstabilityClip gradients (
text
max_grad_norm=1.0
)
Slow convergenceLoss barely decreasingIncrease LR, check data quality
GPU OOMOut of memory errorReduce batch size, use gradient checkpointing
UnderfittingBoth train and val loss highMore epochs, larger model, better data

Learning Rate Schedule

python
# Cosine schedule with warmup (standard for LLM fine-tuning)
TrainingArguments(
    warmup_ratio=0.1,           # Warm up for 10% of steps
    lr_scheduler_type="cosine", # Decay to near-zero
    learning_rate=2e-4,         # Peak LR (for LoRA; use 1e-5 for full FT)
)

Production rule: Never rely on training loss alone. Always evaluate on your downstream task — a lower perplexity does not always mean a better assistant.