Concept #5Hardgen-ai-fundamentals

Explain quantization in LLMs. Why is it important?

#gen-ai#quantization#llm

Answer

Model Quantisation

Quantisation reduces the numerical precision of model weights and/or activations to save memory and speed up inference — with minimal accuracy loss.

Why Quantisation Matters

A 70B parameter model in FP32 requires 280GB VRAM. Most practitioners can't afford that. Quantisation makes large models runnable on consumer hardware:

PrecisionBits7B model VRAM70B model VRAM
FP3232~28 GB~280 GB
FP16 / BF1616~14 GB~140 GB
INT88~7 GB~70 GB
INT4 (NF4)4~3.5 GB~35 GB

Common Quantisation Types

FP16 / BF16 — Half-precision float. BF16 has better numerical range than FP16. Standard for training and fine-tuning.

INT8 — 8-bit integer. Minimal quality loss, 2× memory reduction, faster on modern hardware.

INT4 / NF4 — 4-bit integer or Normal Float 4-bit. Aggressive compression with some quality trade-off. NF4 (from QLoRA) is specifically designed for normally-distributed weights.

Quantising with bitsandbytes + HuggingFace

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit quantisation (NF4 format)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NF4 is best for normally-dist weights
    bnb_4bit_compute_dtype="bfloat16", # Compute in BF16 for stability
    bnb_4bit_use_double_quant=True,    # Double quantisation saves ~0.4 bits
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")

# Model now fits in ~7GB VRAM instead of 26GB

Post-Training Quantisation (PTQ) vs Quantisation-Aware Training (QAT)

MethodWhen AppliedAccuracyCost
PTQAfter training (no data needed)Good for INT8, acceptable for INT4Zero training cost
QATDuring training (simulates quantisation)Better than PTQ at same precisionRequires fine-tuning run
GPTQOne-shot PTQ with calibration dataNear-PTQ quality at INT4Low (calibration only)
AWQActivation-aware PTQBest quality at INT4Low

Rule of thumb: Use BF16 for training, INT4/INT8 for inference. NF4 (bitsandbytes) is the default choice for HuggingFace + LoRA workflows.