What is PEFT? What are all the different types of PEFT?

Question

Accepted Answer

## What is PEFT? **PEFT (Parameter-Efficient Fine-Tuning)** is a collection of techniques for adapting large pre-trained models to specific tasks by updating only a **small fraction of model parameters** instead of all weights. ### Why PEFT Matters Full fine-tuning a 7B parameter model requires ~112 GB of GPU memory (in fp16). PEFT methods can reduce this to under 16 GB while achieving comparable performance. | Approach | Trainable Params | GPU Memory | Performance | |----------|-----------------|------------|-------------| | **Full fine-tuning** | 100% | Very high | Baseline | | **PEFT (LoRA)** | 0.1–1% | Low | Near-baseline | | **PEFT (QLoRA)** | 0.1–1% | Very low | Near-baseline | | **Prompt Tuning** | <0.01% | Minimal | Good | --- ## Types of PEFT ### 1. LoRA (Low-Rank Adaptation) The most widely used PEFT method. Instead of updating weight matrix **W** directly, LoRA freezes **W** and adds two small trainable matrices **A** and **B** where rank `r << d`. $$W' = W + \Delta W = W + BA$$ where $B \in \mathbb{R}^{d imes r}$, $A \in \mathbb{R}^{r imes k}$, and $r \ll \min(d, k)$ ```python from peft import get_peft_model, LoraConfig, TaskType from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B") lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # rank — controls capacity lora_alpha=32, # scaling factor (alpha/r = effective lr scale) lora_dropout=0.05, target_modules=["q_proj", "v_proj"] # which layers to adapt ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 4,718,592 || all params: 3,216,749,568 || trainable%: 0.15% ``` **Best for:** Most fine-tuning tasks; merges back into base model at inference (zero overhead). --- ### 2. QLoRA (Quantized LoRA) LoRA applied on top of a **4-bit quantized** base model. Combines bitsandbytes NF4 quantization with LoRA adapters — makes fine-tuning 70B models feasible on a single GPU. ```python from transformers import BitsAndBytesConfig, AutoModelForCausalLM from peft import get_peft_model, LoraConfig, TaskType import torch # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 — better than fp4 bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True # nested quantization saves extra memory ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-70B", quantization_config=bnb_config, device_map="auto" ) lora_config = LoraConfig( r=64, lora_alpha=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], task_type=TaskType.CAUSAL_LM ) model = get_peft_model(model, lora_config) ``` **Best for:** Fine-tuning very large models (30B–70B+) on consumer GPUs. --- ### 3. Prefix Tuning Prepends **trainable continuous vectors** (a "prefix") to the key and value matrices of every attention layer. The base model weights are fully frozen. ```python from peft import PrefixTuningConfig, get_peft_model, TaskType prefix_config = PrefixTuningConfig( task_type=TaskType.CAUSAL_LM, num_virtual_tokens=20, # number of prefix tokens prepended prefix_projection=True # use MLP reparameterization during training ) model = get_peft_model(model, prefix_config) ``` **Best for:** Sequence-to-sequence tasks (summarization, translation). Slightly more parameters than prompt tuning but more expressive. --- ### 4. Prompt Tuning Learns a set of **soft prompt embeddings** prepended to the input. Unlike hard prompts (text), these are continuous vectors optimized via backpropagation. Only the prompt embeddings are trained. ```python from peft import PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType prompt_config = PromptTuningConfig( task_type=TaskType.CAUSAL_LM, num_virtual_tokens=8, # learnable token count prompt_tuning_init=PromptTuningInit.TEXT, # initialize from real text prompt_tuning_init_text="Classify the sentiment of this review:", tokenizer_name_or_path="gpt2" ) model = get_peft_model(model, prompt_config) ``` **Best for:** Works well at large model scales (>10B). Very few parameters (~8–100 embeddings). --- ### 5. P-Tuning / P-Tuning v2 Uses a small **LSTM or MLP** to generate virtual prompt tokens instead of directly optimizing embeddings. P-Tuning v2 applies prompts at every transformer layer (similar to prefix tuning). ```python from peft import PromptEncoderConfig, get_peft_model, TaskType p_tuning_config = PromptEncoderConfig( task_type=TaskType.SEQ_CLS, num_virtual_tokens=20, encoder_hidden_size=128 # MLP hidden size ) model = get_peft_model(model, p_tuning_config) ``` **Best for:** NLU tasks (classification, NER) where prefix tuning underperforms. --- ### 6. Adapter Layers Inserts small **bottleneck feed-forward modules** between the layers of a frozen transformer. Each adapter has a down-projection, nonlinearity, and up-projection with a residual connection. ``` Input → LayerNorm → Down(d → r) → ReLU → Up(r → d) → + residual → Output ``` ```python from peft import AdaptionPromptConfig, get_peft_model # or use adapter-transformers lib # Using adapter-transformers library from transformers import AutoModelForSequenceClassification import adapters model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") adapters.init(model) model.add_adapter("sentiment_task") model.train_adapter("sentiment_task") # freeze everything except the adapter ``` **Best for:** Multi-task scenarios where you swap different adapter modules per task without reloading the base model. --- ### 7. IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations) Introduces **learned scaling vectors** that rescale keys, values, and feed-forward activations. Extremely parameter-efficient — even fewer parameters than LoRA. ```python from peft import IA3Config, get_peft_model, TaskType ia3_config = IA3Config( task_type=TaskType.CAUSAL_LM, target_modules=["k_proj", "v_proj", "down_proj"], feedforward_modules=["down_proj"] ) model = get_peft_model(model, ia3_config) model.print_trainable_parameters() # trainable params: ~0.01% of total ``` **Best for:** Few-shot scenarios where you want minimal parameters with strong generalization. --- ### 8. BitFit The simplest PEFT method — only fine-tunes the **bias terms** of the model, leaving all weight matrices frozen. ```python # Manual BitFit: freeze everything except biases for name, param in model.named_parameters(): if "bias" not in name: param.requires_grad = False # Bias params are ~0.1% of total model parameters trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable / total * 100:.2f}%") # ~0.08% ``` **Best for:** Simple classification tasks with small datasets where even LoRA is overkill. --- ## PEFT Methods Comparison | Method | Trainable % | Added Inference Cost | Best Use Case | |--------|------------|---------------------|---------------| | **LoRA** | 0.1–1% | None (mergeable) | General fine-tuning | | **QLoRA** | 0.1–1% | Quantization overhead | Large models on small GPUs | | **Prefix Tuning** | <1% | Extra tokens per layer | Seq2seq tasks | | **Prompt Tuning** | <0.01% | Prepend tokens | Very large models | | **P-Tuning v2** | <1% | Extra tokens per layer | NLU classification | | **Adapters** | 1–4% | Extra forward pass | Multi-task serving | | **IA³** | <0.01% | Scaling vectors | Few-shot adaptation | | **BitFit** | ~0.08% | None | Simple classification | --- ## Choosing the Right PEFT Method ``` Large model (>30B) on limited GPU? → QLoRA Need zero inference overhead (merge weights)? → LoRA Serving many tasks on one base model? → Adapter Layers (swap per request) Minimal params, few-shot setting? → IA³ or Prompt Tuning Simple classification, small dataset? → BitFit Seq2seq (translation, summarization)? → Prefix Tuning ``` > **Interview tip:** LoRA is the de facto standard PEFT method in production. QLoRA extends it to large models by adding 4-bit quantization. Know the rank `r` hyperparameter — lower `r` = fewer params but less capacity; higher `r` = more expressive but approaches full fine-tuning. Learn more at [HuggingFace PEFT Library](https://huggingface.co/docs/peft) and [LoRA Paper](https://arxiv.org/abs/2106.09685).

What is PEFT? What are all the different types of PEFT?

Answer

What is PEFT?

Why PEFT Matters

Types of PEFT

1. LoRA (Low-Rank Adaptation)

2. QLoRA (Quantized LoRA)

3. Prefix Tuning

4. Prompt Tuning

5. P-Tuning / P-Tuning v2

6. Adapter Layers

7. IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)

8. BitFit

PEFT Methods Comparison

Choosing the Right PEFT Method

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Approach	Trainable Params	GPU Memory	Performance
Full fine-tuning	100%	Very high	Baseline
PEFT (LoRA)	0.1–1%	Low	Near-baseline
PEFT (QLoRA)	0.1–1%	Very low	Near-baseline
Prompt Tuning	<0.01%	Minimal	Good

Method	Trainable %	Added Inference Cost	Best Use Case
LoRA	0.1–1%	None (mergeable)	General fine-tuning
QLoRA	0.1–1%	Quantization overhead	Large models on small GPUs
Prefix Tuning	<1%	Extra tokens per layer	Seq2seq tasks
Prompt Tuning	<0.01%	Prepend tokens	Very large models
P-Tuning v2	<1%	Extra tokens per layer	NLU classification
Adapters	1–4%	Extra forward pass	Multi-task serving
IA³	<0.01%	Scaling vectors	Few-shot adaptation
BitFit	~0.08%	None	Simple classification