How to fine tune an LLM?

Question

Accepted Answer

## How to Fine-Tune an LLM

**Fine-tuning** adapts a pre-trained LLM to a specific task or domain by continuing training on a curated dataset. It lets you customize model behavior without training from scratch.

---

## Fine-Tuning Approaches

| Approach | Description | VRAM | Best For |
|----------|-------------|------|----------|
| **Full Fine-tuning** | Update all weights | 80GB+ | Max accuracy, large budget |
| **LoRA** | Train low-rank adapter matrices | 16–24GB | Most production use cases |
| **QLoRA** | LoRA on quantized (4-bit) model | 8–12GB | Consumer GPUs |
| **Prefix Tuning** | Prepend trainable tokens | 8GB | Minimal parameter change |
| **Prompt Tuning** | Tune soft prompt embeddings only | 4GB | Lightest approach |

---

## Step-by-Step Fine-Tuning with QLoRA

### Step 1 — Install Dependencies

```bash
pip install transformers peft datasets trl bitsandbytes accelerate
```

### Step 2 — Prepare Dataset

```python
from datasets import Dataset

data = [
    {"instruction": "Summarize the text.", "input": "AI is transforming...", "output": "AI is changing industries."},
    {"instruction": "Translate to French.", "input": "Hello world",           "output": "Bonjour le monde"},
]

def format_prompt(example):
    return {"text": f"### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"}

dataset = Dataset.from_list(data).map(format_prompt)
```

### Step 3 — Load Model in 4-bit (QLoRA)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "meta-llama/Llama-3.2-3B"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
```

### Step 4 — Configure LoRA Adapters

```python
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,                        # rank — higher = more capacity
    lora_alpha=32,               # scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,216,793,600 || trainable%: 0.13%
```

### Step 5 — Train with SFTTrainer

```python
from trl import SFTTrainer
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./fine-tuned-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=args,
    max_seq_length=512,
)

trainer.train()
```

### Step 6 — Save and Merge Adapters

```python
# Save LoRA adapters only
trainer.model.save_pretrained("./lora-adapters")

# Optional: merge adapters back into base model
from peft import PeftModel
merged = PeftModel.from_pretrained(model, "./lora-adapters")
merged = merged.merge_and_unload()
merged.save_pretrained("./merged-model")
```

### Step 7 — Evaluate

```python
from transformers import pipeline

pipe = pipeline("text-generation", model="./merged-model", tokenizer=tokenizer)
output = pipe("### Instruction:
Summarize.

### Input:
AI is transforming...

### Response:
")
print(output[0]["generated_text"])
```

---

## Key Hyperparameters to Tune

| Parameter | Typical Range | Effect |
|-----------|--------------|--------|
| `r` (LoRA rank) | 8–64 | Higher = more capacity, more memory |
| `learning_rate` | 1e-4 – 5e-4 | Too high = instability |
| `num_train_epochs` | 1–5 | More = overfitting risk |
| `max_seq_length` | 512–4096 | Longer = more VRAM |

---

## Best Practices

* Start with **QLoRA** — best cost/quality tradeoff for most tasks
* Use **instruction-tuning format** (system/instruction/response) for chat tasks
* Keep datasets **small but high quality** (1K–50K examples is often enough)
* Always **evaluate on a held-out set** to detect overfitting
* Use `wandb` or TensorBoard to monitor training loss

> **Tip:** For production, fine-tune on your task-specific data, then use RAG on top for dynamic knowledge — you get the best of both worlds.

How to fine tune an LLM?

Answer

How to Fine-Tune an LLM

Fine-Tuning Approaches

Step-by-Step Fine-Tuning with QLoRA

Step 1 — Install Dependencies

Step 2 — Prepare Dataset

Step 3 — Load Model in 4-bit (QLoRA)

Step 4 — Configure LoRA Adapters

Step 5 — Train with SFTTrainer

Step 6 — Save and Merge Adapters

Step 7 — Evaluate

Key Hyperparameters to Tune

Best Practices

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Approach	Description	VRAM	Best For
Full Fine-tuning	Update all weights	80GB+	Max accuracy, large budget
LoRA	Train low-rank adapter matrices	16–24GB	Most production use cases
QLoRA	LoRA on quantized (4-bit) model	8–12GB	Consumer GPUs
Prefix Tuning	Prepend trainable tokens	8GB	Minimal parameter change
Prompt Tuning	Tune soft prompt embeddings only	4GB	Lightest approach

Parameter	Typical Range	Effect
text `r` (LoRA rank)	8–64	Higher = more capacity, more memory
text `learning_rate`	1e-4 – 5e-4	Too high = instability
text `num_train_epochs`	1–5	More = overfitting risk
text `max_seq_length`	512–4096	Longer = more VRAM