Concept #156Hardextended-ai-concepts

What are all the different ways to fine-tune an LLM?

#fine-tuning#lora#qlora#sft#rlhf#dpo#orpo#peft#llm

Answer

All the Different Ways to Fine-Tune an LLM

Fine-tuning adapts a pre-trained LLM to a specific task, domain, or behavior. There are many approaches — from full weight updates to tiny parameter tweaks — each with different cost, data, and performance trade-offs.


Overview

MethodUpdatesData NeededCostBest For
Full Fine-TuningAll weightsLarge labeled datasetVery highMax performance, ample GPU
SFT (Instruction Tuning)All or PEFTInstruction-output pairsMediumTask-following behavior
LoRA / QLoRA0.1–1% paramsMedium datasetLowMost production fine-tuning
PEFT (various)<1% paramsMedium datasetLowMemory-constrained GPUs
RLHFPolicy + reward modelHuman preferencesHighAlignment, safety
DPOPolicy weightsPreference pairsMediumAlignment without reward model
ORPOPolicy weightsPreference pairsLowSimpler alignment than DPO
Continued Pre-trainingAll weightsUnlabeled domain textVery highDomain adaptation
DAPT / TAPTAll weightsDomain/task textHighSpecialized knowledge
Multi-task Fine-TuningAll or PEFTMultiple task datasetsHighGeneralist task models

1. Full Fine-Tuning

Updates all model weights end-to-end on a task-specific dataset. Highest performance ceiling but requires significant GPU memory and compute.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

dataset = load_dataset("json", data_files="train.jsonl")

training_args = TrainingArguments(
    output_dir="./full-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    fp16=True,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer
)
trainer.train()

Pros: Best performance, no architectural constraints Cons: Requires ~2x model size in GPU RAM, risk of catastrophic forgetting


2. Supervised Fine-Tuning (SFT) / Instruction Tuning

Fine-tunes on instruction-response pairs to teach the model to follow task instructions. This is the "Stage 1" of most alignment pipelines (used to create Alpaca, Vicuna, etc.).

python
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

# Dataset format: {"instruction": "...", "output": "..."}
def format_prompt(example):
    return f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    formatting_func=format_prompt,
    max_seq_length=2048,
    peft_config=LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM")
)
trainer.train()

Pros: Teaches instruction-following with modest data (~1K–100K examples) Cons: Model learns to imitate, not necessarily to be helpful/safe


3. LoRA Fine-Tuning

Injects low-rank trainable matrices into frozen attention layers. The most popular PEFT method. See Q155 for full PEFT coverage.

python
from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    r=16,                          # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 0.15% of total

Pros: Near full fine-tuning quality, zero inference overhead (weights merge back) Cons: Rank

text
r
must be tuned; very low
text
r
limits model expressiveness


4. QLoRA Fine-Tuning

LoRA on top of a 4-bit quantized base model. Enables fine-tuning 70B+ parameter models on a single consumer GPU.

python
from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto"
)
# Then apply LoRA config as above

Pros: Fine-tune 70B models on a single A100 (80GB) or even 2x RTX 3090s Cons: Slower training than LoRA on unquantized model; some accuracy loss


5. RLHF (Reinforcement Learning from Human Feedback)

A 3-stage process that aligns models using human preference signals. Used to create ChatGPT, Claude, and Gemini.

Stage 1 — SFT: Fine-tune on curated demonstration data Stage 2 — Reward Model: Train a model to score responses by quality Stage 3 — PPO: Use RL to optimize the SFT model against the reward model

python
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

ppo_config = PPOConfig(
    model_name="sft-model",
    learning_rate=1.41e-5,
    batch_size=16,
    mini_batch_size=4
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=AutoModelForCausalLMWithValueHead.from_pretrained("sft-model"),
    ref_model=AutoModelForCausalLMWithValueHead.from_pretrained("sft-model"),
    tokenizer=tokenizer,
    dataset=dataset,
    reward_model=reward_model
)

# Training loop
for batch in ppo_trainer.dataloader:
    query_tensors = batch["input_ids"]
    response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=200)
    rewards = reward_model(response_tensors)
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

Pros: Strong alignment with human values; used in production by all major labs Cons: Requires large human preference datasets; complex, unstable training


6. DPO (Direct Preference Optimization)

Directly optimizes the LLM on preference pairs (chosen vs rejected responses) without a separate reward model. Simpler and more stable than RLHF + PPO.

python
from trl import DPOTrainer, DPOConfig
from datasets import Dataset

# Dataset format: {prompt, chosen, rejected}
preference_data = Dataset.from_dict({
    "prompt": ["Explain neural networks"],
    "chosen": ["A neural network is a system of layers..."],   # preferred response
    "rejected": ["Neural networks are like brains..."]          # worse response
})

dpo_config = DPOConfig(
    beta=0.1,              # KL penalty — higher = stay closer to reference model
    learning_rate=5e-7,
    num_train_epochs=1,
    per_device_train_batch_size=2
)

dpo_trainer = DPOTrainer(
    model=sft_model,
    ref_model=sft_model_ref,    # frozen reference (the SFT checkpoint)
    args=dpo_config,
    train_dataset=preference_data,
    tokenizer=tokenizer
)
dpo_trainer.train()

Pros: No reward model, no RL instability; much simpler than RLHF Cons: Sensitive to

text
beta
; requires good quality preference data


7. ORPO (Odds Ratio Preference Optimization)

Combines SFT and preference alignment into a single training step — no separate SFT phase, no reference model. Newer and more efficient than DPO.

python
from trl import ORPOTrainer, ORPOConfig

orpo_config = ORPOConfig(
    learning_rate=8e-6,
    lambda_orpo=0.1,       # weight for odds ratio penalty
    num_train_epochs=3,
    per_device_train_batch_size=4
)

orpo_trainer = ORPOTrainer(
    model=base_model,         # start from base, not SFT model
    args=orpo_config,
    train_dataset=preference_data,
    tokenizer=tokenizer
)
orpo_trainer.train()

Pros: Single-stage training (SFT + alignment together); no reference model needed Cons: Less established than DPO; fewer community resources


8. Continued Pre-training (Domain Adaptive)

Continues the pre-training objective (next-token prediction) on large amounts of unlabeled domain text to infuse domain knowledge before task fine-tuning.

python
from transformers import AutoModelForCausalLM, DataCollatorForLanguageModeling

# Load raw domain documents (no labels needed)
# e.g., medical papers, legal documents, code repositories
domain_texts = load_dataset("text", data_files="medical_corpus.txt")

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False   # causal LM — not masked
)

training_args = TrainingArguments(
    output_dir="./domain-pretrained",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    learning_rate=5e-5         # higher LR than SFT — still pre-training phase
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=domain_texts["train"],
    data_collator=data_collator
)
trainer.train()

Pros: Teaches model facts/vocabulary specific to a domain Cons: Requires large domain corpus (GBs to TBs); expensive; must follow with SFT


9. Multi-Task Fine-Tuning

Fine-tunes on multiple tasks simultaneously using a shared model. Prevents catastrophic forgetting and produces a generalist task model (e.g., FLAN-T5).

python
from datasets import concatenate_datasets, load_dataset

# Mix multiple task datasets with task prefix prompts
summarization = load_dataset("cnn_dailymail", "3.0.0")
translation = load_dataset("opus_books", "en-fr")
qa = load_dataset("squad")

def add_task_prefix(example, task):
    example["input"] = f"{task}: {example['input']}"
    return example

# Concatenate with balanced sampling across tasks
mixed_dataset = concatenate_datasets([
    summarization["train"].map(lambda x: add_task_prefix(x, "summarize")),
    translation["train"].map(lambda x: add_task_prefix(x, "translate to French")),
    qa["train"].map(lambda x: add_task_prefix(x, "answer question"))
]).shuffle()

trainer = Trainer(model=model, train_dataset=mixed_dataset, ...)
trainer.train()

Pros: Better generalization; handles multiple tasks with one model Cons: Task interference possible; requires carefully balanced dataset mixing


Fine-Tuning Strategy Decision Guide

text
Do you have unlimited GPU budget?
    → Full Fine-Tuning (maximum performance)

Memory constrained but need quality fine-tuning?
    → LoRA (standard) or QLoRA (very large models)

Need instruction-following from a base model?
    → SFT (Instruction Tuning) + LoRA

Need to align model with human preferences?
    → DPO (simpler) or RLHF (if you have human raters)

Want SFT + alignment in one shot?
    → ORPO

Model lacks domain knowledge (medical, legal, code)?
    → Continued Pre-training → then SFT

Serving many tasks from one model?
    → Multi-Task Fine-Tuning or Adapter Layers (swap per task)

Interview tip: In practice, most production pipelines combine multiple stages: Continued Pre-training (domain knowledge) → SFT (instruction following) → DPO or RLHF (alignment). You rarely use just one method in isolation.

Learn more at HuggingFace TRL Library and DeepLearning.AI Fine-Tuning Course.