Explain instruction tuning. Why is it important for chat models?

Question

Accepted Answer

## Instruction Tuning & Direct Preference Optimisation (DPO) These two techniques are how raw pretrained models are turned into helpful assistants. ### Instruction Tuning (Supervised Fine-tuning / SFT) **What it is:** Fine-tuning a base LLM on examples of instructions paired with ideal responses. This teaches the model to follow instructions, format outputs correctly, and behave like an assistant. ```python from trl import SFTTrainer from datasets import Dataset # Training data format training_data = [ { "text": "~~[INST] Summarise this article in 3 bullet points. [/INST] • Point 1 • Point 2 • Point 3~~ " }, { "text": "~~[INST] Write a Python function to reverse a string. [/INST] ```python def reverse(s: str) -> str: return s[::-1] ```~~ " }, # Typically 10K–100K such examples ] trainer = SFTTrainer( model=model, train_dataset=Dataset.from_list(training_data), dataset_text_field="text", max_seq_length=2048, ) trainer.train() ``` ### Reinforcement Learning from Human Feedback (RLHF) **What it is:** After SFT, human annotators rank model responses by preference. A **reward model** is trained on these rankings, then used to fine-tune the LLM via RL (typically PPO). **Pipeline:** Pretraining → SFT → Reward Model → PPO fine-tuning **Problem:** PPO is complex, unstable, and expensive. This led to DPO. ### Direct Preference Optimisation (DPO) **What it is:** A simpler alternative to RLHF that directly fine-tunes the model on preference pairs (chosen vs rejected responses) without a separate reward model. ```python from trl import DPOTrainer, DPOConfig # Preference dataset format preference_data = [ { "prompt": "Write a haiku about the ocean.", "chosen": "Waves crash on the shore Salt air fills my tired lungs Peace in endless blue", "rejected": "The ocean is big and has lots of water and waves." }, # Typically 5K–50K preference pairs ] dpo_trainer = DPOTrainer( model=model, ref_model=ref_model, # Original SFT model (frozen reference) args=DPOConfig( beta=0.1, # KL divergence penalty — lower = more aggressive learning_rate=5e-7, ), train_dataset=Dataset.from_list(preference_data), ) dpo_trainer.train() ``` ### Comparison: SFT vs RLHF vs DPO | Method | Requires Reward Model | Complexity | Stability | Quality | |--------|-----------------------|------------|-----------|---------| | **SFT** | No | Low | High | Good baseline | | **RLHF (PPO)** | Yes | Very High | Low | Best (GPT-4 level) | | **DPO** | No | Medium | High | Near-RLHF | | **KTO** | No | Low | High | Simpler than DPO | ### The Full Training Pipeline ``` Pretraining (next-token prediction, massive data) ↓ Supervised Fine-tuning / SFT (instruction-response pairs) ↓ Preference Optimisation (RLHF or DPO) ↓ Safety Tuning (Constitutional AI, RLAIF) ``` > **Key insight:** Instruction tuning doesn't teach new knowledge — it teaches the model *how to use* its existing knowledge in a helpful, structured way. Knowledge comes from pretraining; behaviour comes from instruction tuning.

Explain instruction tuning. Why is it important for chat models?

Answer

Instruction Tuning & Direct Preference Optimisation (DPO)

Instruction Tuning (Supervised Fine-tuning / SFT)

Reinforcement Learning from Human Feedback (RLHF)

Direct Preference Optimisation (DPO)

Comparison: SFT vs RLHF vs DPO

The Full Training Pipeline

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Method	Requires Reward Model	Complexity	Stability	Quality
SFT	No	Low	High	Good baseline
RLHF (PPO)	Yes	Very High	Low	Best (GPT-4 level)
DPO	No	Medium	High	Near-RLHF
KTO	No	Low	High	Simpler than DPO