Concept #147Hardextended-ai-concepts

How to fine tune an LLM?

#fine-tuning#lora#qlora#llm#training

Answer

How to Fine-Tune an LLM

Fine-tuning adapts a pre-trained LLM to a specific task or domain by continuing training on a curated dataset. It lets you customize model behavior without training from scratch.


Fine-Tuning Approaches

ApproachDescriptionVRAMBest For
Full Fine-tuningUpdate all weights80GB+Max accuracy, large budget
LoRATrain low-rank adapter matrices16–24GBMost production use cases
QLoRALoRA on quantized (4-bit) model8–12GBConsumer GPUs
Prefix TuningPrepend trainable tokens8GBMinimal parameter change
Prompt TuningTune soft prompt embeddings only4GBLightest approach

Step-by-Step Fine-Tuning with QLoRA

Step 1 — Install Dependencies

bash
pip install transformers peft datasets trl bitsandbytes accelerate

Step 2 — Prepare Dataset

python
from datasets import Dataset

data = [
    {"instruction": "Summarize the text.", "input": "AI is transforming...", "output": "AI is changing industries."},
    {"instruction": "Translate to French.", "input": "Hello world",           "output": "Bonjour le monde"},
]

def format_prompt(example):
    return {"text": f"### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"}

dataset = Dataset.from_list(data).map(format_prompt)

Step 3 — Load Model in 4-bit (QLoRA)

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_id = "meta-llama/Llama-3.2-3B"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

Step 4 — Configure LoRA Adapters

python
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,                        # rank — higher = more capacity
    lora_alpha=32,               # scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,216,793,600 || trainable%: 0.13%

Step 5 — Train with SFTTrainer

python
from trl import SFTTrainer
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./fine-tuned-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=args,
    max_seq_length=512,
)

trainer.train()

Step 6 — Save and Merge Adapters

python
# Save LoRA adapters only
trainer.model.save_pretrained("./lora-adapters")

# Optional: merge adapters back into base model
from peft import PeftModel
merged = PeftModel.from_pretrained(model, "./lora-adapters")
merged = merged.merge_and_unload()
merged.save_pretrained("./merged-model")

Step 7 — Evaluate

python
from transformers import pipeline

pipe = pipeline("text-generation", model="./merged-model", tokenizer=tokenizer)
output = pipe("### Instruction:
Summarize.

### Input:
AI is transforming...

### Response:
")
print(output[0]["generated_text"])

Key Hyperparameters to Tune

ParameterTypical RangeEffect
text
r
(LoRA rank)
8–64Higher = more capacity, more memory
text
learning_rate
1e-4 – 5e-4Too high = instability
text
num_train_epochs
1–5More = overfitting risk
text
max_seq_length
512–4096Longer = more VRAM

Best Practices

  • Start with QLoRA — best cost/quality tradeoff for most tasks
  • Use instruction-tuning format (system/instruction/response) for chat tasks
  • Keep datasets small but high quality (1K–50K examples is often enough)
  • Always evaluate on a held-out set to detect overfitting
  • Use
    text
    wandb
    or TensorBoard to monitor training loss

Tip: For production, fine-tune on your task-specific data, then use RAG on top for dynamic knowledge — you get the best of both worlds.