What metrics would you track during model training?
#gen-ai#mlops#training
Answer
Training Metrics & Monitoring
Monitoring a fine-tuning run effectively separates a successful model from a wasted GPU budget.
Core Training Metrics
| Metric | What it measures | Healthy range |
|---|---|---|
| Training loss | Error on training data (cross-entropy) | Steadily decreasing |
| Validation loss | Error on held-out data | Decreasing, then plateauing |
| Perplexity | exp(loss) — model's "surprise" | Lower is better |
| Learning rate | Step size for gradient updates | Follows schedule (warmup → decay) |
| Gradient norm | Magnitude of gradients | < 1.0 (after clipping) |
| GPU utilisation | Hardware efficiency | > 80% |
| Tokens/second | Training throughput | Maximise this |
Overfitting Detection
textTraining loss: 2.1 → 1.8 → 1.5 → 1.3 → 1.1 (good) Validation loss: 2.2 → 1.9 → 1.7 → 1.8 → 2.1 (overfitting — val loss increases!)
Fix: Early stopping, increase dropout, reduce epochs, add more training data.
Using WandB for Experiment Tracking
pythonimport wandb from transformers import TrainingArguments, Trainer wandb.init( project="llm-fine-tuning", name="llama2-7b-instruction-v1", config={ "model": "meta-llama/Llama-2-7b-hf", "lora_r": 16, "learning_rate": 2e-4, "epochs": 3, } ) training_args = TrainingArguments( output_dir="./output", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=8, # Effective batch = 32 learning_rate=2e-4, warmup_ratio=0.1, lr_scheduler_type="cosine", logging_steps=10, eval_steps=100, save_steps=500, report_to="wandb", # Automatic WandB logging load_best_model_at_end=True, metric_for_best_model="eval_loss", )
Common Training Problems & Fixes
| Problem | Symptom | Fix |
|---|---|---|
| Overfitting | Val loss increases while train loss decreases | Reduce epochs, add dropout, more data |
| Loss explosion | Loss suddenly goes to NaN or very high | Reduce LR, clip gradients, check data |
| Gradient norm explosion | Instability | Clip gradients ( text |
| Slow convergence | Loss barely decreasing | Increase LR, check data quality |
| GPU OOM | Out of memory error | Reduce batch size, use gradient checkpointing |
| Underfitting | Both train and val loss high | More epochs, larger model, better data |
Learning Rate Schedule
python# Cosine schedule with warmup (standard for LLM fine-tuning) TrainingArguments( warmup_ratio=0.1, # Warm up for 10% of steps lr_scheduler_type="cosine", # Decay to near-zero learning_rate=2e-4, # Peak LR (for LoRA; use 1e-5 for full FT) )
Production rule: Never rely on training loss alone. Always evaluate on your downstream task — a lower perplexity does not always mean a better assistant.