What is PEFT? What are all the different types of PEFT?
Answer
What is PEFT?
PEFT (Parameter-Efficient Fine-Tuning) is a collection of techniques for adapting large pre-trained models to specific tasks by updating only a small fraction of model parameters instead of all weights.
Why PEFT Matters
Full fine-tuning a 7B parameter model requires ~112 GB of GPU memory (in fp16). PEFT methods can reduce this to under 16 GB while achieving comparable performance.
| Approach | Trainable Params | GPU Memory | Performance |
|---|---|---|---|
| Full fine-tuning | 100% | Very high | Baseline |
| PEFT (LoRA) | 0.1β1% | Low | Near-baseline |
| PEFT (QLoRA) | 0.1β1% | Very low | Near-baseline |
| Prompt Tuning | <0.01% | Minimal | Good |
Types of PEFT
1. LoRA (Low-Rank Adaptation)
The most widely used PEFT method. Instead of updating weight matrix W directly, LoRA freezes W and adds two small trainable matrices A and B where rank
r << d
where , , and
pythonfrom peft import get_peft_model, LoraConfig, TaskType from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B") lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # rank β controls capacity lora_alpha=32, # scaling factor (alpha/r = effective lr scale) lora_dropout=0.05, target_modules=["q_proj", "v_proj"] # which layers to adapt ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 4,718,592 || all params: 3,216,749,568 || trainable%: 0.15%
Best for: Most fine-tuning tasks; merges back into base model at inference (zero overhead).
2. QLoRA (Quantized LoRA)
LoRA applied on top of a 4-bit quantized base model. Combines bitsandbytes NF4 quantization with LoRA adapters β makes fine-tuning 70B models feasible on a single GPU.
pythonfrom transformers import BitsAndBytesConfig, AutoModelForCausalLM from peft import get_peft_model, LoraConfig, TaskType import torch # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 β better than fp4 bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True # nested quantization saves extra memory ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-70B", quantization_config=bnb_config, device_map="auto" ) lora_config = LoraConfig( r=64, lora_alpha=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], task_type=TaskType.CAUSAL_LM ) model = get_peft_model(model, lora_config)
Best for: Fine-tuning very large models (30Bβ70B+) on consumer GPUs.
3. Prefix Tuning
Prepends trainable continuous vectors (a "prefix") to the key and value matrices of every attention layer. The base model weights are fully frozen.
pythonfrom peft import PrefixTuningConfig, get_peft_model, TaskType prefix_config = PrefixTuningConfig( task_type=TaskType.CAUSAL_LM, num_virtual_tokens=20, # number of prefix tokens prepended prefix_projection=True # use MLP reparameterization during training ) model = get_peft_model(model, prefix_config)
Best for: Sequence-to-sequence tasks (summarization, translation). Slightly more parameters than prompt tuning but more expressive.
4. Prompt Tuning
Learns a set of soft prompt embeddings prepended to the input. Unlike hard prompts (text), these are continuous vectors optimized via backpropagation. Only the prompt embeddings are trained.
pythonfrom peft import PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType prompt_config = PromptTuningConfig( task_type=TaskType.CAUSAL_LM, num_virtual_tokens=8, # learnable token count prompt_tuning_init=PromptTuningInit.TEXT, # initialize from real text prompt_tuning_init_text="Classify the sentiment of this review:", tokenizer_name_or_path="gpt2" ) model = get_peft_model(model, prompt_config)
Best for: Works well at large model scales (>10B). Very few parameters (~8β100 embeddings).
5. P-Tuning / P-Tuning v2
Uses a small LSTM or MLP to generate virtual prompt tokens instead of directly optimizing embeddings. P-Tuning v2 applies prompts at every transformer layer (similar to prefix tuning).
pythonfrom peft import PromptEncoderConfig, get_peft_model, TaskType p_tuning_config = PromptEncoderConfig( task_type=TaskType.SEQ_CLS, num_virtual_tokens=20, encoder_hidden_size=128 # MLP hidden size ) model = get_peft_model(model, p_tuning_config)
Best for: NLU tasks (classification, NER) where prefix tuning underperforms.
6. Adapter Layers
Inserts small bottleneck feed-forward modules between the layers of a frozen transformer. Each adapter has a down-projection, nonlinearity, and up-projection with a residual connection.
textInput β LayerNorm β Down(d β r) β ReLU β Up(r β d) β + residual β Output
pythonfrom peft import AdaptionPromptConfig, get_peft_model # or use adapter-transformers lib # Using adapter-transformers library from transformers import AutoModelForSequenceClassification import adapters model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") adapters.init(model) model.add_adapter("sentiment_task") model.train_adapter("sentiment_task") # freeze everything except the adapter
Best for: Multi-task scenarios where you swap different adapter modules per task without reloading the base model.
7. IAΒ³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)
Introduces learned scaling vectors that rescale keys, values, and feed-forward activations. Extremely parameter-efficient β even fewer parameters than LoRA.
pythonfrom peft import IA3Config, get_peft_model, TaskType ia3_config = IA3Config( task_type=TaskType.CAUSAL_LM, target_modules=["k_proj", "v_proj", "down_proj"], feedforward_modules=["down_proj"] ) model = get_peft_model(model, ia3_config) model.print_trainable_parameters() # trainable params: ~0.01% of total
Best for: Few-shot scenarios where you want minimal parameters with strong generalization.
8. BitFit
The simplest PEFT method β only fine-tunes the bias terms of the model, leaving all weight matrices frozen.
python# Manual BitFit: freeze everything except biases for name, param in model.named_parameters(): if "bias" not in name: param.requires_grad = False # Bias params are ~0.1% of total model parameters trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable / total * 100:.2f}%") # ~0.08%
Best for: Simple classification tasks with small datasets where even LoRA is overkill.
PEFT Methods Comparison
| Method | Trainable % | Added Inference Cost | Best Use Case |
|---|---|---|---|
| LoRA | 0.1β1% | None (mergeable) | General fine-tuning |
| QLoRA | 0.1β1% | Quantization overhead | Large models on small GPUs |
| Prefix Tuning | <1% | Extra tokens per layer | Seq2seq tasks |
| Prompt Tuning | <0.01% | Prepend tokens | Very large models |
| P-Tuning v2 | <1% | Extra tokens per layer | NLU classification |
| Adapters | 1β4% | Extra forward pass | Multi-task serving |
| IAΒ³ | <0.01% | Scaling vectors | Few-shot adaptation |
| BitFit | ~0.08% | None | Simple classification |
Choosing the Right PEFT Method
textLarge model (>30B) on limited GPU? β QLoRA Need zero inference overhead (merge weights)? β LoRA Serving many tasks on one base model? β Adapter Layers (swap per request) Minimal params, few-shot setting? β IAΒ³ or Prompt Tuning Simple classification, small dataset? β BitFit Seq2seq (translation, summarization)? β Prefix Tuning
Interview tip: LoRA is the de facto standard PEFT method in production. QLoRA extends it to large models by adding 4-bit quantization. Know the rank
hyperparameter β lowertextr= fewer params but less capacity; highertextr= more expressive but approaches full fine-tuning.textr
Learn more at HuggingFace PEFT Library and LoRA Paper.