What is quantization and what are all the different ways to do it?
Answer
What is Quantization in AI?
Quantization reduces the numerical precision of model weights (and activations) from high-precision floats (FP32/BF16) to lower-precision formats (FP16, INT8, INT4). This shrinks model size, speeds up inference, and reduces memory usage — with a small trade-off in accuracy.
Why Quantization Matters
textFP32 model (LLaMA 3.2 3B): ~12 GB VRAM BF16 model: ~6 GB VRAM INT8 model: ~3 GB VRAM INT4 model (QLoRA/GGUF): ~1.5 GB VRAM ← runs on a laptop!
Precision Formats Overview
| Format | Bits | Range | Use Case |
|---|---|---|---|
| FP32 | 32 | ±3.4×10³⁸ | Training (full precision) |
| BF16 | 16 | ±3.4×10³⁸ | Training / inference (same range as FP32) |
| FP16 | 16 | ±65,504 | Inference (narrower range, can overflow) |
| FP8 | 8 | Two variants (E4M3, E5M2) | H100 training & inference |
| INT8 | 8 | -128 to 127 | Inference (good accuracy/speed balance) |
| INT4 | 4 | -8 to 7 | Aggressive compression, slight quality loss |
Types of Quantization
1. Post-Training Quantization (PTQ)
Quantize after training — no retraining needed. Fastest to apply.
python# Simple FP16 inference (half precision) import torch from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", torch_dtype=torch.float16, # FP16 device_map="auto" )
2. BitsAndBytes Quantization (INT8 / INT4)
The easiest way to quantize HuggingFace models on consumer GPUs:
pythonfrom transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch # INT8 quantization bnb_int8 = BitsAndBytesConfig(load_in_8bit=True) # INT4 quantization (NF4 — best for LLMs) bnb_int4 = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 — best quality bnb_4bit_use_double_quant=True, # extra compression bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", quantization_config=bnb_int4, device_map="auto" )
3. GPTQ (Post-Training Quantization with Calibration)
Quantizes weights per-layer using a small calibration dataset. Better accuracy than naive INT4:
pythonfrom transformers import AutoModelForCausalLM, GPTQConfig, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B") gptq_config = GPTQConfig( bits=4, dataset="wikitext2", # calibration dataset tokenizer=tokenizer ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", quantization_config=gptq_config, device_map="auto" ) model.save_pretrained("./llama3-gptq-4bit")
4. AWQ (Activation-Aware Weight Quantization)
Protects salient weights (weights that strongly affect output) during quantization. Often better than GPTQ:
pythonfrom awq import AutoAWQForCausalLM from transformers import AutoTokenizer model_path = "meta-llama/Llama-3.2-3B" quant_path = "./llama3-awq-4bit" model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"} model.quantize(tokenizer, quant_config=quant_config) model.save_quantized(quant_path)
5. GGUF / llama.cpp (CPU-Friendly Quantization)
GGUF is the format used by
llama.cppbash# Convert to GGUF and quantize git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make python3 convert_hf_to_gguf.py meta-llama/Llama-3.2-3B --outfile llama3.gguf # Quantize to Q4_K_M (best quality/size tradeoff for 4-bit) ./llama-quantize llama3.gguf llama3-Q4_K_M.gguf Q4_K_M
python# Run with Python via llama-cpp-python from llama_cpp import Llama llm = Llama(model_path="llama3-Q4_K_M.gguf", n_ctx=4096) output = llm("Explain quantization in simple terms:", max_tokens=200) print(output["choices"][0]["text"])
6. FP8 Quantization (H100 / Modern Hardware)
FP8 is the newest format, supported on NVIDIA H100 GPUs:
pythonfrom transformers import AutoModelForCausalLM import torch # FP8 — requires H100 or newer GPU model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B", torch_dtype=torch.float8_e4m3fn, # FP8 E4M3 variant device_map="auto" )
7. Quantization-Aware Training (QAT)
Simulate quantization during training so the model adapts to reduced precision:
pythonimport torch.quantization model.train() model.qconfig = torch.quantization.get_default_qat_qconfig("fbgemm") torch.quantization.prepare_qat(model, inplace=True) # ... normal training loop ... model.eval() torch.quantization.convert(model, inplace=True) # convert to quantized
Comparison of Quantization Methods
| Method | Bits | Quality Loss | Speed Gain | VRAM Saved | Use Case |
|---|---|---|---|---|---|
| FP16 | 16 | Negligible | 1.5× | 50% | Standard inference |
| BF16 | 16 | Negligible | 1.5× | 50% | Training + inference |
| INT8 (bitsandbytes) | 8 | Very low | 2× | 75% | General inference |
| INT4 NF4 | 4 | Low | 3–4× | 87% | Consumer GPU inference |
| GPTQ 4-bit | 4 | Low | 3–4× | 87% | Production INT4 |
| AWQ 4-bit | 4 | Very low | 3–4× | 87% | Best INT4 accuracy |
| GGUF Q4_K_M | 4 | Low | 3× | 87% | CPU / Apple Silicon |
| FP8 | 8 | Very low | 2× | 75% | H100 training |
| QAT | 4–8 | Minimal | 3–4× | 75–87% | Best accuracy for size |
How to Choose
textRunning on CPU / Apple Silicon? → GGUF (Q4_K_M or Q5_K_M) Running on consumer GPU (RTX 3090/4090)? → bitsandbytes INT4 (NF4) or GPTQ Running on datacenter GPU (A100)? → AWQ or GPTQ (best accuracy) Running on H100? → FP8 Need to fine-tune a quantized model? → QLoRA (INT4 NF4 via bitsandbytes)
Rule of thumb: Start with
(GGUF) for CPU ortextQ4_K_M(bitsandbytes) for GPU. Move to AWQ if you need better accuracy at the same bit width.textNF4