Explain quantization in LLMs. Why is it important?
Answer
Model Quantisation
Quantisation reduces the numerical precision of model weights and/or activations to save memory and speed up inference — with minimal accuracy loss.
Why Quantisation Matters
A 70B parameter model in FP32 requires 280GB VRAM. Most practitioners can't afford that. Quantisation makes large models runnable on consumer hardware:
| Precision | Bits | 7B model VRAM | 70B model VRAM |
|---|---|---|---|
| FP32 | 32 | ~28 GB | ~280 GB |
| FP16 / BF16 | 16 | ~14 GB | ~140 GB |
| INT8 | 8 | ~7 GB | ~70 GB |
| INT4 (NF4) | 4 | ~3.5 GB | ~35 GB |
Common Quantisation Types
FP16 / BF16 — Half-precision float. BF16 has better numerical range than FP16. Standard for training and fine-tuning.
INT8 — 8-bit integer. Minimal quality loss, 2× memory reduction, faster on modern hardware.
INT4 / NF4 — 4-bit integer or Normal Float 4-bit. Aggressive compression with some quality trade-off. NF4 (from QLoRA) is specifically designed for normally-distributed weights.
Quantising with bitsandbytes + HuggingFace
pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # 4-bit quantisation (NF4 format) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NF4 is best for normally-dist weights bnb_4bit_compute_dtype="bfloat16", # Compute in BF16 for stability bnb_4bit_use_double_quant=True, # Double quantisation saves ~0.4 bits ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-13b-hf", quantization_config=bnb_config, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf") # Model now fits in ~7GB VRAM instead of 26GB
Post-Training Quantisation (PTQ) vs Quantisation-Aware Training (QAT)
| Method | When Applied | Accuracy | Cost |
|---|---|---|---|
| PTQ | After training (no data needed) | Good for INT8, acceptable for INT4 | Zero training cost |
| QAT | During training (simulates quantisation) | Better than PTQ at same precision | Requires fine-tuning run |
| GPTQ | One-shot PTQ with calibration data | Near-PTQ quality at INT4 | Low (calibration only) |
| AWQ | Activation-aware PTQ | Best quality at INT4 | Low |
Rule of thumb: Use BF16 for training, INT4/INT8 for inference. NF4 (bitsandbytes) is the default choice for HuggingFace + LoRA workflows.