What's the difference between LoRA and full fine-tuning?
Answer
LoRA vs Full Fine-tuning
LoRA (Low-Rank Adaptation) is the dominant technique for fine-tuning LLMs efficiently. Understanding it is essential for any Gen AI engineer.
The Problem with Full Fine-tuning
Full fine-tuning updates all model parameters — billions of weights — requiring enormous GPU memory and compute. For a 7B model, that's ~28GB VRAM just to store the gradients.
How LoRA Works
LoRA freezes the original weights and adds small low-rank adapter matrices to specific layers (typically attention projections):
Where:
- is the frozen original weight matrix
- and are the trainable adapters
- is the rank (typically 4–64)
Instead of updating parameters, you only train parameters — often 1,000× fewer.
LoRA with PEFT + HuggingFace
pythonfrom peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", load_in_4bit=True, # QLoRA: quantise base model to 4-bit device_map="auto", ) lora_config = LoraConfig( r=16, # Rank — higher = more capacity, more params lora_alpha=32, # Scaling factor (alpha/r is the effective LR scale) target_modules=["q_proj", "v_proj"], # Which layers to adapt lora_dropout=0.05, task_type=TaskType.CAUSAL_LM, ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 3,504,279,552 || trainable%: 0.12%
QLoRA: LoRA + 4-bit Quantisation
QLoRA combines LoRA with 4-bit NF4 quantisation of the base model. This allows fine-tuning a 70B model on a single 48GB GPU.
Comparison Table
| Feature | Full Fine-tuning | LoRA | QLoRA |
|---|---|---|---|
| Trainable params | 100% | 0.1–1% | 0.1–1% |
| VRAM (7B model) | ~112GB | ~16GB | ~6GB |
| Training speed | Slowest | 2–3× faster | Similar to LoRA |
| Quality vs base | Best | Near-identical | Slightly below LoRA |
| Adapter portability | ❌ Full model | ✅ Tiny adapter file (~30MB) | ✅ Tiny adapter file |
| Multi-task | Separate model per task | Swap adapters at runtime | Swap adapters |
Merging LoRA Adapters
pythonfrom peft import PeftModel # Load base model + merge adapter into weights base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") model = PeftModel.from_pretrained(base_model, "./lora-adapter") merged_model = model.merge_and_unload() # Merges adapter into base weights merged_model.save_pretrained("./merged-model")
When to choose LoRA: Almost always. The only reason to choose full fine-tuning over LoRA is if you need maximum possible quality and have the GPU budget. QLoRA is the default choice for most production fine-tuning.