Explain instruction tuning. Why is it important for chat models?
Answer
Instruction Tuning & Direct Preference Optimisation (DPO)
These two techniques are how raw pretrained models are turned into helpful assistants.
Instruction Tuning (Supervised Fine-tuning / SFT)
What it is: Fine-tuning a base LLM on examples of instructions paired with ideal responses. This teaches the model to follow instructions, format outputs correctly, and behave like an assistant.
pythonfrom trl import SFTTrainer from datasets import Dataset # Training data format training_data = [ { "text": "<s>[INST] Summarise this article in 3 bullet points. [/INST] • Point 1\n• Point 2\n• Point 3 </s>" }, { "text": "<s>[INST] Write a Python function to reverse a string. [/INST] ```python\ndef reverse(s: str) -> str:\n return s[::-1]\n``` </s>" }, # Typically 10K–100K such examples ] trainer = SFTTrainer( model=model, train_dataset=Dataset.from_list(training_data), dataset_text_field="text", max_seq_length=2048, ) trainer.train()
Reinforcement Learning from Human Feedback (RLHF)
What it is: After SFT, human annotators rank model responses by preference. A reward model is trained on these rankings, then used to fine-tune the LLM via RL (typically PPO).
Pipeline: Pretraining → SFT → Reward Model → PPO fine-tuning
Problem: PPO is complex, unstable, and expensive. This led to DPO.
Direct Preference Optimisation (DPO)
What it is: A simpler alternative to RLHF that directly fine-tunes the model on preference pairs (chosen vs rejected responses) without a separate reward model.
pythonfrom trl import DPOTrainer, DPOConfig # Preference dataset format preference_data = [ { "prompt": "Write a haiku about the ocean.", "chosen": "Waves crash on the shore\nSalt air fills my tired lungs\nPeace in endless blue", "rejected": "The ocean is big and has lots of water and waves." }, # Typically 5K–50K preference pairs ] dpo_trainer = DPOTrainer( model=model, ref_model=ref_model, # Original SFT model (frozen reference) args=DPOConfig( beta=0.1, # KL divergence penalty — lower = more aggressive learning_rate=5e-7, ), train_dataset=Dataset.from_list(preference_data), ) dpo_trainer.train()
Comparison: SFT vs RLHF vs DPO
| Method | Requires Reward Model | Complexity | Stability | Quality |
|---|---|---|---|---|
| SFT | No | Low | High | Good baseline |
| RLHF (PPO) | Yes | Very High | Low | Best (GPT-4 level) |
| DPO | No | Medium | High | Near-RLHF |
| KTO | No | Low | High | Simpler than DPO |
The Full Training Pipeline
textPretraining (next-token prediction, massive data) ↓ Supervised Fine-tuning / SFT (instruction-response pairs) ↓ Preference Optimisation (RLHF or DPO) ↓ Safety Tuning (Constitutional AI, RLAIF)
Key insight: Instruction tuning doesn't teach new knowledge — it teaches the model how to use its existing knowledge in a helpful, structured way. Knowledge comes from pretraining; behaviour comes from instruction tuning.