What are all the different ways to fine-tune an LLM?

Question

Accepted Answer

## All the Different Ways to Fine-Tune an LLM

Fine-tuning adapts a pre-trained LLM to a specific task, domain, or behavior. There are many approaches — from full weight updates to tiny parameter tweaks — each with different cost, data, and performance trade-offs.

---

## Overview

| Method | Updates | Data Needed | Cost | Best For |
|--------|---------|-------------|------|----------|
| **Full Fine-Tuning** | All weights | Large labeled dataset | Very high | Max performance, ample GPU |
| **SFT (Instruction Tuning)** | All or PEFT | Instruction-output pairs | Medium | Task-following behavior |
| **LoRA / QLoRA** | 0.1–1% params | Medium dataset | Low | Most production fine-tuning |
| **PEFT (various)** | <1% params | Medium dataset | Low | Memory-constrained GPUs |
| **RLHF** | Policy + reward model | Human preferences | High | Alignment, safety |
| **DPO** | Policy weights | Preference pairs | Medium | Alignment without reward model |
| **ORPO** | Policy weights | Preference pairs | Low | Simpler alignment than DPO |
| **Continued Pre-training** | All weights | Unlabeled domain text | Very high | Domain adaptation |
| **DAPT / TAPT** | All weights | Domain/task text | High | Specialized knowledge |
| **Multi-task Fine-Tuning** | All or PEFT | Multiple task datasets | High | Generalist task models |

---

## 1. Full Fine-Tuning

Updates **all model weights** end-to-end on a task-specific dataset. Highest performance ceiling but requires significant GPU memory and compute.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

dataset = load_dataset("json", data_files="train.jsonl")

training_args = TrainingArguments(
    output_dir="./full-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    fp16=True,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer
)
trainer.train()
```

**Pros:** Best performance, no architectural constraints
**Cons:** Requires ~2x model size in GPU RAM, risk of catastrophic forgetting

---

## 2. Supervised Fine-Tuning (SFT) / Instruction Tuning

Fine-tunes on **instruction-response pairs** to teach the model to follow task instructions. This is the "Stage 1" of most alignment pipelines (used to create Alpaca, Vicuna, etc.).

```python
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

# Dataset format: {"instruction": "...", "output": "..."}
def format_prompt(example):
    return f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    formatting_func=format_prompt,
    max_seq_length=2048,
    peft_config=LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM")
)
trainer.train()
```

**Pros:** Teaches instruction-following with modest data (~1K–100K examples)
**Cons:** Model learns to imitate, not necessarily to be helpful/safe

---

## 3. LoRA Fine-Tuning

Injects **low-rank trainable matrices** into frozen attention layers. The most popular PEFT method. See Q155 for full PEFT coverage.

```python
from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    r=16,                          # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 0.15% of total
```

**Pros:** Near full fine-tuning quality, zero inference overhead (weights merge back)
**Cons:** Rank `r` must be tuned; very low `r` limits model expressiveness

---

## 4. QLoRA Fine-Tuning

LoRA on top of a **4-bit quantized** base model. Enables fine-tuning 70B+ parameter models on a single consumer GPU.

```python
from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto"
)
# Then apply LoRA config as above
```

**Pros:** Fine-tune 70B models on a single A100 (80GB) or even 2x RTX 3090s
**Cons:** Slower training than LoRA on unquantized model; some accuracy loss

---

## 5. RLHF (Reinforcement Learning from Human Feedback)

A 3-stage process that aligns models using human preference signals. Used to create ChatGPT, Claude, and Gemini.

```mermaid
graph LR
    A[Pre-trained LLM] --> B[SFT Model]
    B --> C[Reward Model Training]
    C --> D[RL with PPO]
    D --> E[Aligned Model]
    H[Human Preferences] --> C
```

**Stage 1 — SFT:** Fine-tune on curated demonstration data
**Stage 2 — Reward Model:** Train a model to score responses by quality
**Stage 3 — PPO:** Use RL to optimize the SFT model against the reward model

```python
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

ppo_config = PPOConfig(
    model_name="sft-model",
    learning_rate=1.41e-5,
    batch_size=16,
    mini_batch_size=4
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=AutoModelForCausalLMWithValueHead.from_pretrained("sft-model"),
    ref_model=AutoModelForCausalLMWithValueHead.from_pretrained("sft-model"),
    tokenizer=tokenizer,
    dataset=dataset,
    reward_model=reward_model
)

# Training loop
for batch in ppo_trainer.dataloader:
    query_tensors = batch["input_ids"]
    response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=200)
    rewards = reward_model(response_tensors)
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
```

**Pros:** Strong alignment with human values; used in production by all major labs
**Cons:** Requires large human preference datasets; complex, unstable training

---

## 6. DPO (Direct Preference Optimization)

Directly optimizes the LLM on preference pairs (chosen vs rejected responses) **without a separate reward model**. Simpler and more stable than RLHF + PPO.

```python
from trl import DPOTrainer, DPOConfig
from datasets import Dataset

# Dataset format: {prompt, chosen, rejected}
preference_data = Dataset.from_dict({
    "prompt": ["Explain neural networks"],
    "chosen": ["A neural network is a system of layers..."],   # preferred response
    "rejected": ["Neural networks are like brains..."]          # worse response
})

dpo_config = DPOConfig(
    beta=0.1,              # KL penalty — higher = stay closer to reference model
    learning_rate=5e-7,
    num_train_epochs=1,
    per_device_train_batch_size=2
)

dpo_trainer = DPOTrainer(
    model=sft_model,
    ref_model=sft_model_ref,    # frozen reference (the SFT checkpoint)
    args=dpo_config,
    train_dataset=preference_data,
    tokenizer=tokenizer
)
dpo_trainer.train()
```

**Pros:** No reward model, no RL instability; much simpler than RLHF
**Cons:** Sensitive to `beta`; requires good quality preference data

---

## 7. ORPO (Odds Ratio Preference Optimization)

Combines SFT and preference alignment into a **single training step** — no separate SFT phase, no reference model. Newer and more efficient than DPO.

```python
from trl import ORPOTrainer, ORPOConfig

orpo_config = ORPOConfig(
    learning_rate=8e-6,
    lambda_orpo=0.1,       # weight for odds ratio penalty
    num_train_epochs=3,
    per_device_train_batch_size=4
)

orpo_trainer = ORPOTrainer(
    model=base_model,         # start from base, not SFT model
    args=orpo_config,
    train_dataset=preference_data,
    tokenizer=tokenizer
)
orpo_trainer.train()
```

**Pros:** Single-stage training (SFT + alignment together); no reference model needed
**Cons:** Less established than DPO; fewer community resources

---

## 8. Continued Pre-training (Domain Adaptive)

Continues the **pre-training objective** (next-token prediction) on large amounts of **unlabeled domain text** to infuse domain knowledge before task fine-tuning.

```python
from transformers import AutoModelForCausalLM, DataCollatorForLanguageModeling

# Load raw domain documents (no labels needed)
# e.g., medical papers, legal documents, code repositories
domain_texts = load_dataset("text", data_files="medical_corpus.txt")

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False   # causal LM — not masked
)

training_args = TrainingArguments(
    output_dir="./domain-pretrained",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    learning_rate=5e-5         # higher LR than SFT — still pre-training phase
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=domain_texts["train"],
    data_collator=data_collator
)
trainer.train()
```

**Pros:** Teaches model facts/vocabulary specific to a domain
**Cons:** Requires large domain corpus (GBs to TBs); expensive; must follow with SFT

---

## 9. Multi-Task Fine-Tuning

Fine-tunes on **multiple tasks simultaneously** using a shared model. Prevents catastrophic forgetting and produces a generalist task model (e.g., FLAN-T5).

```python
from datasets import concatenate_datasets, load_dataset

# Mix multiple task datasets with task prefix prompts
summarization = load_dataset("cnn_dailymail", "3.0.0")
translation = load_dataset("opus_books", "en-fr")
qa = load_dataset("squad")

def add_task_prefix(example, task):
    example["input"] = f"{task}: {example['input']}"
    return example

# Concatenate with balanced sampling across tasks
mixed_dataset = concatenate_datasets([
    summarization["train"].map(lambda x: add_task_prefix(x, "summarize")),
    translation["train"].map(lambda x: add_task_prefix(x, "translate to French")),
    qa["train"].map(lambda x: add_task_prefix(x, "answer question"))
]).shuffle()

trainer = Trainer(model=model, train_dataset=mixed_dataset, ...)
trainer.train()
```

**Pros:** Better generalization; handles multiple tasks with one model
**Cons:** Task interference possible; requires carefully balanced dataset mixing

---

## Fine-Tuning Strategy Decision Guide

```
Do you have unlimited GPU budget?
    → Full Fine-Tuning (maximum performance)

Memory constrained but need quality fine-tuning?
    → LoRA (standard) or QLoRA (very large models)

Need instruction-following from a base model?
    → SFT (Instruction Tuning) + LoRA

Need to align model with human preferences?
    → DPO (simpler) or RLHF (if you have human raters)

Want SFT + alignment in one shot?
    → ORPO

Model lacks domain knowledge (medical, legal, code)?
    → Continued Pre-training → then SFT

Serving many tasks from one model?
    → Multi-Task Fine-Tuning or Adapter Layers (swap per task)
```

> **Interview tip:** In practice, most production pipelines combine multiple stages: **Continued Pre-training** (domain knowledge) → **SFT** (instruction following) → **DPO or RLHF** (alignment). You rarely use just one method in isolation.

Learn more at [HuggingFace TRL Library](https://huggingface.co/docs/trl) and [DeepLearning.AI Fine-Tuning Course](https://www.deeplearning.ai/short-courses/finetuning-large-language-models/).

What are all the different ways to fine-tune an LLM?

Answer

All the Different Ways to Fine-Tune an LLM

Overview

1. Full Fine-Tuning

2. Supervised Fine-Tuning (SFT) / Instruction Tuning

3. LoRA Fine-Tuning

4. QLoRA Fine-Tuning

5. RLHF (Reinforcement Learning from Human Feedback)

6. DPO (Direct Preference Optimization)

7. ORPO (Odds Ratio Preference Optimization)

8. Continued Pre-training (Domain Adaptive)

9. Multi-Task Fine-Tuning

Fine-Tuning Strategy Decision Guide

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Method	Updates	Data Needed	Cost	Best For
Full Fine-Tuning	All weights	Large labeled dataset	Very high	Max performance, ample GPU
SFT (Instruction Tuning)	All or PEFT	Instruction-output pairs	Medium	Task-following behavior
LoRA / QLoRA	0.1–1% params	Medium dataset	Low	Most production fine-tuning
PEFT (various)	<1% params	Medium dataset	Low	Memory-constrained GPUs
RLHF	Policy + reward model	Human preferences	High	Alignment, safety
DPO	Policy weights	Preference pairs	Medium	Alignment without reward model
ORPO	Policy weights	Preference pairs	Low	Simpler alignment than DPO
Continued Pre-training	All weights	Unlabeled domain text	Very high	Domain adaptation
DAPT / TAPT	All weights	Domain/task text	High	Specialized knowledge
Multi-task Fine-Tuning	All or PEFT	Multiple task datasets	High	Generalist task models