What are all the different ways to fine-tune an LLM?
Answer
All the Different Ways to Fine-Tune an LLM
Fine-tuning adapts a pre-trained LLM to a specific task, domain, or behavior. There are many approaches — from full weight updates to tiny parameter tweaks — each with different cost, data, and performance trade-offs.
Overview
| Method | Updates | Data Needed | Cost | Best For |
|---|---|---|---|---|
| Full Fine-Tuning | All weights | Large labeled dataset | Very high | Max performance, ample GPU |
| SFT (Instruction Tuning) | All or PEFT | Instruction-output pairs | Medium | Task-following behavior |
| LoRA / QLoRA | 0.1–1% params | Medium dataset | Low | Most production fine-tuning |
| PEFT (various) | <1% params | Medium dataset | Low | Memory-constrained GPUs |
| RLHF | Policy + reward model | Human preferences | High | Alignment, safety |
| DPO | Policy weights | Preference pairs | Medium | Alignment without reward model |
| ORPO | Policy weights | Preference pairs | Low | Simpler alignment than DPO |
| Continued Pre-training | All weights | Unlabeled domain text | Very high | Domain adaptation |
| DAPT / TAPT | All weights | Domain/task text | High | Specialized knowledge |
| Multi-task Fine-Tuning | All or PEFT | Multiple task datasets | High | Generalist task models |
1. Full Fine-Tuning
Updates all model weights end-to-end on a task-specific dataset. Highest performance ceiling but requires significant GPU memory and compute.
pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments from datasets import load_dataset model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B") dataset = load_dataset("json", data_files="train.jsonl") training_args = TrainingArguments( output_dir="./full-finetuned", num_train_epochs=3, per_device_train_batch_size=4, learning_rate=2e-5, fp16=True, save_strategy="epoch" ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], tokenizer=tokenizer ) trainer.train()
Pros: Best performance, no architectural constraints Cons: Requires ~2x model size in GPU RAM, risk of catastrophic forgetting
2. Supervised Fine-Tuning (SFT) / Instruction Tuning
Fine-tunes on instruction-response pairs to teach the model to follow task instructions. This is the "Stage 1" of most alignment pipelines (used to create Alpaca, Vicuna, etc.).
pythonfrom trl import SFTTrainer from peft import LoraConfig from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B") # Dataset format: {"instruction": "...", "output": "..."} def format_prompt(example): return f"""### Instruction: {example['instruction']} ### Response: {example['output']}""" trainer = SFTTrainer( model=model, train_dataset=dataset, formatting_func=format_prompt, max_seq_length=2048, peft_config=LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM") ) trainer.train()
Pros: Teaches instruction-following with modest data (~1K–100K examples) Cons: Model learns to imitate, not necessarily to be helpful/safe
3. LoRA Fine-Tuning
Injects low-rank trainable matrices into frozen attention layers. The most popular PEFT method. See Q155 for full PEFT coverage.
pythonfrom peft import get_peft_model, LoraConfig, TaskType lora_config = LoraConfig( r=16, # rank lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, task_type=TaskType.CAUSAL_LM ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 0.15% of total
Pros: Near full fine-tuning quality, zero inference overhead (weights merge back) Cons: Rank
rr4. QLoRA Fine-Tuning
LoRA on top of a 4-bit quantized base model. Enables fine-tuning 70B+ parameter models on a single consumer GPU.
pythonfrom transformers import BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-70B", quantization_config=bnb_config, device_map="auto" ) # Then apply LoRA config as above
Pros: Fine-tune 70B models on a single A100 (80GB) or even 2x RTX 3090s Cons: Slower training than LoRA on unquantized model; some accuracy loss
5. RLHF (Reinforcement Learning from Human Feedback)
A 3-stage process that aligns models using human preference signals. Used to create ChatGPT, Claude, and Gemini.
Stage 1 — SFT: Fine-tune on curated demonstration data Stage 2 — Reward Model: Train a model to score responses by quality Stage 3 — PPO: Use RL to optimize the SFT model against the reward model
pythonfrom trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead ppo_config = PPOConfig( model_name="sft-model", learning_rate=1.41e-5, batch_size=16, mini_batch_size=4 ) ppo_trainer = PPOTrainer( config=ppo_config, model=AutoModelForCausalLMWithValueHead.from_pretrained("sft-model"), ref_model=AutoModelForCausalLMWithValueHead.from_pretrained("sft-model"), tokenizer=tokenizer, dataset=dataset, reward_model=reward_model ) # Training loop for batch in ppo_trainer.dataloader: query_tensors = batch["input_ids"] response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=200) rewards = reward_model(response_tensors) stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
Pros: Strong alignment with human values; used in production by all major labs Cons: Requires large human preference datasets; complex, unstable training
6. DPO (Direct Preference Optimization)
Directly optimizes the LLM on preference pairs (chosen vs rejected responses) without a separate reward model. Simpler and more stable than RLHF + PPO.
pythonfrom trl import DPOTrainer, DPOConfig from datasets import Dataset # Dataset format: {prompt, chosen, rejected} preference_data = Dataset.from_dict({ "prompt": ["Explain neural networks"], "chosen": ["A neural network is a system of layers..."], # preferred response "rejected": ["Neural networks are like brains..."] # worse response }) dpo_config = DPOConfig( beta=0.1, # KL penalty — higher = stay closer to reference model learning_rate=5e-7, num_train_epochs=1, per_device_train_batch_size=2 ) dpo_trainer = DPOTrainer( model=sft_model, ref_model=sft_model_ref, # frozen reference (the SFT checkpoint) args=dpo_config, train_dataset=preference_data, tokenizer=tokenizer ) dpo_trainer.train()
Pros: No reward model, no RL instability; much simpler than RLHF Cons: Sensitive to
beta7. ORPO (Odds Ratio Preference Optimization)
Combines SFT and preference alignment into a single training step — no separate SFT phase, no reference model. Newer and more efficient than DPO.
pythonfrom trl import ORPOTrainer, ORPOConfig orpo_config = ORPOConfig( learning_rate=8e-6, lambda_orpo=0.1, # weight for odds ratio penalty num_train_epochs=3, per_device_train_batch_size=4 ) orpo_trainer = ORPOTrainer( model=base_model, # start from base, not SFT model args=orpo_config, train_dataset=preference_data, tokenizer=tokenizer ) orpo_trainer.train()
Pros: Single-stage training (SFT + alignment together); no reference model needed Cons: Less established than DPO; fewer community resources
8. Continued Pre-training (Domain Adaptive)
Continues the pre-training objective (next-token prediction) on large amounts of unlabeled domain text to infuse domain knowledge before task fine-tuning.
pythonfrom transformers import AutoModelForCausalLM, DataCollatorForLanguageModeling # Load raw domain documents (no labels needed) # e.g., medical papers, legal documents, code repositories domain_texts = load_dataset("text", data_files="medical_corpus.txt") data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=False # causal LM — not masked ) training_args = TrainingArguments( output_dir="./domain-pretrained", num_train_epochs=1, per_device_train_batch_size=8, learning_rate=5e-5 # higher LR than SFT — still pre-training phase ) trainer = Trainer( model=model, args=training_args, train_dataset=domain_texts["train"], data_collator=data_collator ) trainer.train()
Pros: Teaches model facts/vocabulary specific to a domain Cons: Requires large domain corpus (GBs to TBs); expensive; must follow with SFT
9. Multi-Task Fine-Tuning
Fine-tunes on multiple tasks simultaneously using a shared model. Prevents catastrophic forgetting and produces a generalist task model (e.g., FLAN-T5).
pythonfrom datasets import concatenate_datasets, load_dataset # Mix multiple task datasets with task prefix prompts summarization = load_dataset("cnn_dailymail", "3.0.0") translation = load_dataset("opus_books", "en-fr") qa = load_dataset("squad") def add_task_prefix(example, task): example["input"] = f"{task}: {example['input']}" return example # Concatenate with balanced sampling across tasks mixed_dataset = concatenate_datasets([ summarization["train"].map(lambda x: add_task_prefix(x, "summarize")), translation["train"].map(lambda x: add_task_prefix(x, "translate to French")), qa["train"].map(lambda x: add_task_prefix(x, "answer question")) ]).shuffle() trainer = Trainer(model=model, train_dataset=mixed_dataset, ...) trainer.train()
Pros: Better generalization; handles multiple tasks with one model Cons: Task interference possible; requires carefully balanced dataset mixing
Fine-Tuning Strategy Decision Guide
textDo you have unlimited GPU budget? → Full Fine-Tuning (maximum performance) Memory constrained but need quality fine-tuning? → LoRA (standard) or QLoRA (very large models) Need instruction-following from a base model? → SFT (Instruction Tuning) + LoRA Need to align model with human preferences? → DPO (simpler) or RLHF (if you have human raters) Want SFT + alignment in one shot? → ORPO Model lacks domain knowledge (medical, legal, code)? → Continued Pre-training → then SFT Serving many tasks from one model? → Multi-Task Fine-Tuning or Adapter Layers (swap per task)
Interview tip: In practice, most production pipelines combine multiple stages: Continued Pre-training (domain knowledge) → SFT (instruction following) → DPO or RLHF (alignment). You rarely use just one method in isolation.
Learn more at HuggingFace TRL Library and DeepLearning.AI Fine-Tuning Course.