Concept #11Hardgen-ai-fundamentals

What's the difference between LoRA and full fine-tuning?

#gen-ai#fine-tuning#lora

Answer

LoRA vs Full Fine-tuning

LoRA (Low-Rank Adaptation) is the dominant technique for fine-tuning LLMs efficiently. Understanding it is essential for any Gen AI engineer.

The Problem with Full Fine-tuning

Full fine-tuning updates all model parameters — billions of weights — requiring enormous GPU memory and compute. For a 7B model, that's ~28GB VRAM just to store the gradients.

How LoRA Works

LoRA freezes the original weights and adds small low-rank adapter matrices to specific layers (typically attention projections):

W=W+ΔW=W+BAW' = W + \Delta W = W + BA

Where:

  • WRd×dW \in \mathbb{R}^{d \times d} is the frozen original weight matrix
  • BRd×rB \in \mathbb{R}^{d \times r} and ARr×dA \in \mathbb{R}^{r \times d} are the trainable adapters
  • rdr \ll d is the rank (typically 4–64)

Instead of updating d2d^2 parameters, you only train 2×d×r2 \times d \times r parameters — often 1,000× fewer.

LoRA with PEFT + HuggingFace

python
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_4bit=True,   # QLoRA: quantise base model to 4-bit
    device_map="auto",
)

lora_config = LoraConfig(
    r=16,                          # Rank — higher = more capacity, more params
    lora_alpha=32,                 # Scaling factor (alpha/r is the effective LR scale)
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,504,279,552 || trainable%: 0.12%

QLoRA: LoRA + 4-bit Quantisation

QLoRA combines LoRA with 4-bit NF4 quantisation of the base model. This allows fine-tuning a 70B model on a single 48GB GPU.

Comparison Table

FeatureFull Fine-tuningLoRAQLoRA
Trainable params100%0.1–1%0.1–1%
VRAM (7B model)~112GB~16GB~6GB
Training speedSlowest2–3× fasterSimilar to LoRA
Quality vs baseBestNear-identicalSlightly below LoRA
Adapter portability❌ Full model✅ Tiny adapter file (~30MB)✅ Tiny adapter file
Multi-taskSeparate model per taskSwap adapters at runtimeSwap adapters

Merging LoRA Adapters

python
from peft import PeftModel

# Load base model + merge adapter into weights
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
merged_model = model.merge_and_unload()  # Merges adapter into base weights
merged_model.save_pretrained("./merged-model")

When to choose LoRA: Almost always. The only reason to choose full fine-tuning over LoRA is if you need maximum possible quality and have the GPU budget. QLoRA is the default choice for most production fine-tuning.