What are all the types of quantization in AI?
#gen-ai#quantization
Answer
Types of Quantization in AI
Quantization comes in many flavors — differing in when quantization is applied, what format is used, and the accuracy/performance tradeoff.
By Format (Data Type)
| Format | Bits | Description |
|---|---|---|
| FP32 | 32 | Full precision float (standard training) |
| TF32 | 19 effective | Tensor Float 32 — NVIDIA A100 training |
| BF16 | 16 | Brain Float 16 — same range as FP32, good for training |
| FP16 | 16 | Half precision — inference and mixed-precision training |
| FP8 | 8 | 8-bit float — H100 training, two variants (E4M3, E5M2) |
| INT8 | 8 | 8-bit integer — inference |
| INT4 | 4 | 4-bit integer — aggressive compression |
| NF4 | 4 | Normal Float 4 — better distribution for neural nets |
| INT2 | 2 | 2-bit — experimental, high quality loss |
| Binary | 1 | 1-bit (BitNet) — research stage |
By When Applied
| Type | When | How |
|---|---|---|
| Post-Training Quantization (PTQ) | After training | Apply to pre-trained model — no retraining |
| Quantization-Aware Training (QAT) | During training | Simulate quantization error during training |
| Dynamic Quantization | At inference time | Quantize activations dynamically per forward pass |
| Static Quantization | Before inference | Pre-compute activation scales using calibration data |
Post-Training Quantization (PTQ)
Simplest — quantize a trained model with no retraining:
pythonfrom transformers import AutoModelForCausalLM, BitsAndBytesConfig # PTQ: apply 4-bit quantization to pre-trained model bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4" ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", quantization_config=bnb_config ) # Weights stored in 4-bit, computation in BF16
Quantization-Aware Training (QAT)
Better quality — trains with quantization simulation:
pythonimport torch.quantization model = MyModel() # Add fake quantization nodes during training model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') model_qat = torch.quantization.prepare_qat(model) # Train as normal — model learns to be robust to quantization trainer.train(model_qat) # Convert to fully quantized model model_quantized = torch.quantization.convert(model_qat)
GGUF Format (llama.cpp / Ollama)
Most local model tools use GGUF format with named quantization levels:
| GGUF Name | Bits per weight | Quality |
|---|---|---|
| Q2_K | ~2.6 bits | Very low |
| Q3_K_M | ~3.4 bits | Low |
| Q4_0 | 4 bits | Medium |
| Q4_K_M | ~4.5 bits | Good (recommended) |
| Q5_K_M | ~5.7 bits | Better |
| Q6_K | ~6.6 bits | High |
| Q8_0 | 8 bits | Near lossless |
| F16 | 16 bits | Full half-precision |
bash# Ollama uses GGUF quantization ollama pull llama3.1:8b # Uses Q4_K_M by default ollama pull llama3.1:8b-q8_0 # Q8 (better quality, larger)
GPTQ vs AWQ vs GGUF
| Format | Creator | Approach | Best For |
|---|---|---|---|
| GPTQ | IST Austria | Layer-wise quantization | GPU inference |
| AWQ | MIT | Activation-aware weights | GPU inference (faster) |
| GGUF | llama.cpp | CPU+GPU hybrid | Local inference (Ollama) |
| bitsandbytes | Tim Dettmers | 4/8-bit with NF4 | Fine-tuning (QLoRA) |
Practical Recommendation
textFor production API inference → BF16 or FP16 on GPU For local 7-8B models on consumer GPU → Q4_K_M or INT4 For local 7-8B on CPU → Q4_0 or Q5_K_M For fine-tuning on limited VRAM → QLoRA (NF4) For edge/mobile → INT8 with NPU