What's the difference between LoRA and full fine-tuning?

Question

Accepted Answer

## LoRA vs Full Fine-tuning

LoRA (Low-Rank Adaptation) is the dominant technique for fine-tuning LLMs efficiently. Understanding it is essential for any Gen AI engineer.

### The Problem with Full Fine-tuning

Full fine-tuning updates **all** model parameters — billions of weights — requiring enormous GPU memory and compute. For a 7B model, that's ~28GB VRAM just to store the gradients.

### How LoRA Works

LoRA freezes the original weights and adds small **low-rank adapter matrices** to specific layers (typically attention projections):

$$W' = W + \Delta W = W + BA$$

Where:
- $W \in \mathbb{R}^{d 	imes d}$ is the frozen original weight matrix
- $B \in \mathbb{R}^{d 	imes r}$ and $A \in \mathbb{R}^{r 	imes d}$ are the trainable adapters
- $r \ll d$ is the rank (typically 4–64)

Instead of updating $d^2$ parameters, you only train $2 	imes d 	imes r$ parameters — often **1,000× fewer**.

### LoRA with PEFT + HuggingFace

```python
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_4bit=True,   # QLoRA: quantise base model to 4-bit
    device_map="auto",
)

lora_config = LoraConfig(
    r=16,                          # Rank — higher = more capacity, more params
    lora_alpha=32,                 # Scaling factor (alpha/r is the effective LR scale)
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,504,279,552 || trainable%: 0.12%
```

### QLoRA: LoRA + 4-bit Quantisation

QLoRA combines LoRA with 4-bit NF4 quantisation of the base model. This allows fine-tuning a 70B model on a **single 48GB GPU**.

### Comparison Table

| Feature | Full Fine-tuning | LoRA | QLoRA |
|---------|----------------|------|-------|
| **Trainable params** | 100% | 0.1–1% | 0.1–1% |
| **VRAM (7B model)** | ~112GB | ~16GB | ~6GB |
| **Training speed** | Slowest | 2–3× faster | Similar to LoRA |
| **Quality vs base** | Best | Near-identical | Slightly below LoRA |
| **Adapter portability** | ❌ Full model | ✅ Tiny adapter file (~30MB) | ✅ Tiny adapter file |
| **Multi-task** | Separate model per task | Swap adapters at runtime | Swap adapters |

### Merging LoRA Adapters

```python
from peft import PeftModel

# Load base model + merge adapter into weights
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
merged_model = model.merge_and_unload()  # Merges adapter into base weights
merged_model.save_pretrained("./merged-model")
```

> **When to choose LoRA:** Almost always. The only reason to choose full fine-tuning over LoRA is if you need maximum possible quality and have the GPU budget. QLoRA is the default choice for most production fine-tuning.

What's the difference between LoRA and full fine-tuning?

Answer

LoRA vs Full Fine-tuning

The Problem with Full Fine-tuning

How LoRA Works

LoRA with PEFT + HuggingFace

QLoRA: LoRA + 4-bit Quantisation

Comparison Table

Merging LoRA Adapters

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Feature	Full Fine-tuning	LoRA	QLoRA
Trainable params	100%	0.1–1%	0.1–1%
VRAM (7B model)	~112GB	~16GB	~6GB
Training speed	Slowest	2–3× faster	Similar to LoRA
Quality vs base	Best	Near-identical	Slightly below LoRA
Adapter portability	❌ Full model	✅ Tiny adapter file (~30MB)	✅ Tiny adapter file
Multi-task	Separate model per task	Swap adapters at runtime	Swap adapters