Explain quantization in LLMs. Why is it important?

Question

Accepted Answer

## Model Quantisation

**Quantisation** reduces the numerical precision of model weights and/or activations to save memory and speed up inference — with minimal accuracy loss.

### Why Quantisation Matters

A 70B parameter model in FP32 requires **280GB VRAM**. Most practitioners can't afford that. Quantisation makes large models runnable on consumer hardware:

| Precision | Bits | 7B model VRAM | 70B model VRAM |
|-----------|------|---------------|----------------|
| FP32 | 32 | ~28 GB | ~280 GB |
| FP16 / BF16 | 16 | ~14 GB | ~140 GB |
| INT8 | 8 | ~7 GB | ~70 GB |
| INT4 (NF4) | 4 | ~3.5 GB | ~35 GB |

### Common Quantisation Types

**FP16 / BF16** — Half-precision float. BF16 has better numerical range than FP16. Standard for training and fine-tuning.

**INT8** — 8-bit integer. Minimal quality loss, 2× memory reduction, faster on modern hardware.

**INT4 / NF4** — 4-bit integer or Normal Float 4-bit. Aggressive compression with some quality trade-off. NF4 (from QLoRA) is specifically designed for normally-distributed weights.

### Quantising with bitsandbytes + HuggingFace

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit quantisation (NF4 format)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NF4 is best for normally-dist weights
    bnb_4bit_compute_dtype="bfloat16", # Compute in BF16 for stability
    bnb_4bit_use_double_quant=True,    # Double quantisation saves ~0.4 bits
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")

# Model now fits in ~7GB VRAM instead of 26GB
```

### Post-Training Quantisation (PTQ) vs Quantisation-Aware Training (QAT)

| Method | When Applied | Accuracy | Cost |
|--------|-------------|----------|------|
| **PTQ** | After training (no data needed) | Good for INT8, acceptable for INT4 | Zero training cost |
| **QAT** | During training (simulates quantisation) | Better than PTQ at same precision | Requires fine-tuning run |
| **GPTQ** | One-shot PTQ with calibration data | Near-PTQ quality at INT4 | Low (calibration only) |
| **AWQ** | Activation-aware PTQ | Best quality at INT4 | Low |

> **Rule of thumb:** Use BF16 for training, INT4/INT8 for inference. NF4 (bitsandbytes) is the default choice for HuggingFace + LoRA workflows.

Explain quantization in LLMs. Why is it important?

Answer

Model Quantisation

Why Quantisation Matters

Common Quantisation Types

Quantising with bitsandbytes + HuggingFace

Post-Training Quantisation (PTQ) vs Quantisation-Aware Training (QAT)

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

What's the difference between fine-tuning and prompt engineering?

Precision	Bits	7B model VRAM	70B model VRAM
FP32	32	~28 GB	~280 GB
FP16 / BF16	16	~14 GB	~140 GB
INT8	8	~7 GB	~70 GB
INT4 (NF4)	4	~3.5 GB	~35 GB

Method	When Applied	Accuracy	Cost
PTQ	After training (no data needed)	Good for INT8, acceptable for INT4	Zero training cost
QAT	During training (simulates quantisation)	Better than PTQ at same precision	Requires fine-tuning run
GPTQ	One-shot PTQ with calibration data	Near-PTQ quality at INT4	Low (calibration only)
AWQ	Activation-aware PTQ	Best quality at INT4	Low