Concept #155Hardextended-ai-concepts

What is PEFT? What are all the different types of PEFT?

#peft#fine-tuning#lora#qlora#prompt-tuning#adapters#llm

Answer

What is PEFT?

PEFT (Parameter-Efficient Fine-Tuning) is a collection of techniques for adapting large pre-trained models to specific tasks by updating only a small fraction of model parameters instead of all weights.

Why PEFT Matters

Full fine-tuning a 7B parameter model requires ~112 GB of GPU memory (in fp16). PEFT methods can reduce this to under 16 GB while achieving comparable performance.

ApproachTrainable ParamsGPU MemoryPerformance
Full fine-tuning100%Very highBaseline
PEFT (LoRA)0.1–1%LowNear-baseline
PEFT (QLoRA)0.1–1%Very lowNear-baseline
Prompt Tuning<0.01%MinimalGood

Types of PEFT

1. LoRA (Low-Rank Adaptation)

The most widely used PEFT method. Instead of updating weight matrix W directly, LoRA freezes W and adds two small trainable matrices A and B where rank

text
r << d
.

Wβ€²=W+Ξ”W=W+BAW' = W + \Delta W = W + BA

where B∈RdΓ—rB \in \mathbb{R}^{d \times r}, A∈RrΓ—kA \in \mathbb{R}^{r \times k}, and rβ‰ͺmin⁑(d,k)r \ll \min(d, k)

python
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                        # rank β€” controls capacity
    lora_alpha=32,               # scaling factor (alpha/r = effective lr scale)
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"]  # which layers to adapt
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,718,592 || all params: 3,216,749,568 || trainable%: 0.15%

Best for: Most fine-tuning tasks; merges back into base model at inference (zero overhead).


2. QLoRA (Quantized LoRA)

LoRA applied on top of a 4-bit quantized base model. Combines bitsandbytes NF4 quantization with LoRA adapters β€” makes fine-tuning 70B models feasible on a single GPU.

python
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
from peft import get_peft_model, LoraConfig, TaskType
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 β€” better than fp4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True      # nested quantization saves extra memory
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto"
)

lora_config = LoraConfig(
    r=64, lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)

Best for: Fine-tuning very large models (30B–70B+) on consumer GPUs.


3. Prefix Tuning

Prepends trainable continuous vectors (a "prefix") to the key and value matrices of every attention layer. The base model weights are fully frozen.

python
from peft import PrefixTuningConfig, get_peft_model, TaskType

prefix_config = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=20,       # number of prefix tokens prepended
    prefix_projection=True       # use MLP reparameterization during training
)

model = get_peft_model(model, prefix_config)

Best for: Sequence-to-sequence tasks (summarization, translation). Slightly more parameters than prompt tuning but more expressive.


4. Prompt Tuning

Learns a set of soft prompt embeddings prepended to the input. Unlike hard prompts (text), these are continuous vectors optimized via backpropagation. Only the prompt embeddings are trained.

python
from peft import PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType

prompt_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=8,                         # learnable token count
    prompt_tuning_init=PromptTuningInit.TEXT,     # initialize from real text
    prompt_tuning_init_text="Classify the sentiment of this review:",
    tokenizer_name_or_path="gpt2"
)

model = get_peft_model(model, prompt_config)

Best for: Works well at large model scales (>10B). Very few parameters (~8–100 embeddings).


5. P-Tuning / P-Tuning v2

Uses a small LSTM or MLP to generate virtual prompt tokens instead of directly optimizing embeddings. P-Tuning v2 applies prompts at every transformer layer (similar to prefix tuning).

python
from peft import PromptEncoderConfig, get_peft_model, TaskType

p_tuning_config = PromptEncoderConfig(
    task_type=TaskType.SEQ_CLS,
    num_virtual_tokens=20,
    encoder_hidden_size=128      # MLP hidden size
)

model = get_peft_model(model, p_tuning_config)

Best for: NLU tasks (classification, NER) where prefix tuning underperforms.


6. Adapter Layers

Inserts small bottleneck feed-forward modules between the layers of a frozen transformer. Each adapter has a down-projection, nonlinearity, and up-projection with a residual connection.

text
Input β†’ LayerNorm β†’ Down(d β†’ r) β†’ ReLU β†’ Up(r β†’ d) β†’ + residual β†’ Output
python
from peft import AdaptionPromptConfig, get_peft_model  # or use adapter-transformers lib

# Using adapter-transformers library
from transformers import AutoModelForSequenceClassification
import adapters

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
adapters.init(model)

model.add_adapter("sentiment_task")
model.train_adapter("sentiment_task")    # freeze everything except the adapter

Best for: Multi-task scenarios where you swap different adapter modules per task without reloading the base model.


7. IAΒ³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)

Introduces learned scaling vectors that rescale keys, values, and feed-forward activations. Extremely parameter-efficient β€” even fewer parameters than LoRA.

python
from peft import IA3Config, get_peft_model, TaskType

ia3_config = IA3Config(
    task_type=TaskType.CAUSAL_LM,
    target_modules=["k_proj", "v_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)

model = get_peft_model(model, ia3_config)
model.print_trainable_parameters()
# trainable params: ~0.01% of total

Best for: Few-shot scenarios where you want minimal parameters with strong generalization.


8. BitFit

The simplest PEFT method β€” only fine-tunes the bias terms of the model, leaving all weight matrices frozen.

python
# Manual BitFit: freeze everything except biases
for name, param in model.named_parameters():
    if "bias" not in name:
        param.requires_grad = False

# Bias params are ~0.1% of total model parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable / total * 100:.2f}%")  # ~0.08%

Best for: Simple classification tasks with small datasets where even LoRA is overkill.


PEFT Methods Comparison

MethodTrainable %Added Inference CostBest Use Case
LoRA0.1–1%None (mergeable)General fine-tuning
QLoRA0.1–1%Quantization overheadLarge models on small GPUs
Prefix Tuning<1%Extra tokens per layerSeq2seq tasks
Prompt Tuning<0.01%Prepend tokensVery large models
P-Tuning v2<1%Extra tokens per layerNLU classification
Adapters1–4%Extra forward passMulti-task serving
IAΒ³<0.01%Scaling vectorsFew-shot adaptation
BitFit~0.08%NoneSimple classification

Choosing the Right PEFT Method

text
Large model (>30B) on limited GPU?
    β†’ QLoRA

Need zero inference overhead (merge weights)?
    β†’ LoRA

Serving many tasks on one base model?
    β†’ Adapter Layers (swap per request)

Minimal params, few-shot setting?
    β†’ IAΒ³ or Prompt Tuning

Simple classification, small dataset?
    β†’ BitFit

Seq2seq (translation, summarization)?
    β†’ Prefix Tuning

Interview tip: LoRA is the de facto standard PEFT method in production. QLoRA extends it to large models by adding 4-bit quantization. Know the rank

text
r
hyperparameter β€” lower
text
r
= fewer params but less capacity; higher
text
r
= more expressive but approaches full fine-tuning.

Learn more at HuggingFace PEFT Library and LoRA Paper.