What metrics would you track during model training?

Question

Accepted Answer

## Training Metrics & Monitoring

Monitoring a fine-tuning run effectively separates a successful model from a wasted GPU budget.

### Core Training Metrics

| Metric | What it measures | Healthy range |
|--------|-----------------|---------------|
| **Training loss** | Error on training data (cross-entropy) | Steadily decreasing |
| **Validation loss** | Error on held-out data | Decreasing, then plateauing |
| **Perplexity** | exp(loss) — model's "surprise" | Lower is better |
| **Learning rate** | Step size for gradient updates | Follows schedule (warmup → decay) |
| **Gradient norm** | Magnitude of gradients | < 1.0 (after clipping) |
| **GPU utilisation** | Hardware efficiency | > 80% |
| **Tokens/second** | Training throughput | Maximise this |

### Overfitting Detection

```
Training loss:    2.1 → 1.8 → 1.5 → 1.3 → 1.1 (good)
Validation loss:  2.2 → 1.9 → 1.7 → 1.8 → 2.1 (overfitting — val loss increases!)
```

**Fix:** Early stopping, increase dropout, reduce epochs, add more training data.

### Using WandB for Experiment Tracking

```python
import wandb
from transformers import TrainingArguments, Trainer

wandb.init(
    project="llm-fine-tuning",
    name="llama2-7b-instruction-v1",
    config={
        "model": "meta-llama/Llama-2-7b-hf",
        "lora_r": 16,
        "learning_rate": 2e-4,
        "epochs": 3,
    }
)

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,   # Effective batch = 32
    learning_rate=2e-4,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    logging_steps=10,
    eval_steps=100,
    save_steps=500,
    report_to="wandb",               # Automatic WandB logging
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)
```

### Common Training Problems & Fixes

| Problem | Symptom | Fix |
|---------|---------|-----|
| **Overfitting** | Val loss increases while train loss decreases | Reduce epochs, add dropout, more data |
| **Loss explosion** | Loss suddenly goes to NaN or very high | Reduce LR, clip gradients, check data |
| **Gradient norm explosion** | Instability | Clip gradients (`max_grad_norm=1.0`) |
| **Slow convergence** | Loss barely decreasing | Increase LR, check data quality |
| **GPU OOM** | Out of memory error | Reduce batch size, use gradient checkpointing |
| **Underfitting** | Both train and val loss high | More epochs, larger model, better data |

### Learning Rate Schedule

```python
# Cosine schedule with warmup (standard for LLM fine-tuning)
TrainingArguments(
    warmup_ratio=0.1,           # Warm up for 10% of steps
    lr_scheduler_type="cosine", # Decay to near-zero
    learning_rate=2e-4,         # Peak LR (for LoRA; use 1e-5 for full FT)
)
```

> **Production rule:** Never rely on training loss alone. Always evaluate on your downstream task — a lower perplexity does not always mean a better assistant.

What metrics would you track during model training?

Answer

Training Metrics & Monitoring

Core Training Metrics

Overfitting Detection

Using WandB for Experiment Tracking

Common Training Problems & Fixes

Learning Rate Schedule

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Metric	What it measures	Healthy range
Training loss	Error on training data (cross-entropy)	Steadily decreasing
Validation loss	Error on held-out data	Decreasing, then plateauing
Perplexity	exp(loss) — model's "surprise"	Lower is better
Learning rate	Step size for gradient updates	Follows schedule (warmup → decay)
Gradient norm	Magnitude of gradients	< 1.0 (after clipping)
GPU utilisation	Hardware efficiency	> 80%
Tokens/second	Training throughput	Maximise this

Problem	Symptom	Fix
Overfitting	Val loss increases while train loss decreases	Reduce epochs, add dropout, more data
Loss explosion	Loss suddenly goes to NaN or very high	Reduce LR, clip gradients, check data
Gradient norm explosion	Instability	Clip gradients ( text `max_grad_norm=1.0` )
Slow convergence	Loss barely decreasing	Increase LR, check data quality
GPU OOM	Out of memory error	Reduce batch size, use gradient checkpointing
Underfitting	Both train and val loss high	More epochs, larger model, better data