Concept #127Hardextended-ai-concepts

What are all the types of quantization in AI?

#gen-ai#quantization

Answer

Types of Quantization in AI

Quantization comes in many flavors — differing in when quantization is applied, what format is used, and the accuracy/performance tradeoff.

By Format (Data Type)

FormatBitsDescription
FP3232Full precision float (standard training)
TF3219 effectiveTensor Float 32 — NVIDIA A100 training
BF1616Brain Float 16 — same range as FP32, good for training
FP1616Half precision — inference and mixed-precision training
FP888-bit float — H100 training, two variants (E4M3, E5M2)
INT888-bit integer — inference
INT444-bit integer — aggressive compression
NF44Normal Float 4 — better distribution for neural nets
INT222-bit — experimental, high quality loss
Binary11-bit (BitNet) — research stage

By When Applied

TypeWhenHow
Post-Training Quantization (PTQ)After trainingApply to pre-trained model — no retraining
Quantization-Aware Training (QAT)During trainingSimulate quantization error during training
Dynamic QuantizationAt inference timeQuantize activations dynamically per forward pass
Static QuantizationBefore inferencePre-compute activation scales using calibration data

Post-Training Quantization (PTQ)

Simplest — quantize a trained model with no retraining:

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# PTQ: apply 4-bit quantization to pre-trained model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config
)
# Weights stored in 4-bit, computation in BF16

Quantization-Aware Training (QAT)

Better quality — trains with quantization simulation:

python
import torch.quantization

model = MyModel()

# Add fake quantization nodes during training
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_qat = torch.quantization.prepare_qat(model)

# Train as normal — model learns to be robust to quantization
trainer.train(model_qat)

# Convert to fully quantized model
model_quantized = torch.quantization.convert(model_qat)

GGUF Format (llama.cpp / Ollama)

Most local model tools use GGUF format with named quantization levels:

GGUF NameBits per weightQuality
Q2_K~2.6 bitsVery low
Q3_K_M~3.4 bitsLow
Q4_04 bitsMedium
Q4_K_M~4.5 bitsGood (recommended)
Q5_K_M~5.7 bitsBetter
Q6_K~6.6 bitsHigh
Q8_08 bitsNear lossless
F1616 bitsFull half-precision
bash
# Ollama uses GGUF quantization
ollama pull llama3.1:8b             # Uses Q4_K_M by default
ollama pull llama3.1:8b-q8_0       # Q8 (better quality, larger)

GPTQ vs AWQ vs GGUF

FormatCreatorApproachBest For
GPTQIST AustriaLayer-wise quantizationGPU inference
AWQMITActivation-aware weightsGPU inference (faster)
GGUFllama.cppCPU+GPU hybridLocal inference (Ollama)
bitsandbytesTim Dettmers4/8-bit with NF4Fine-tuning (QLoRA)

Practical Recommendation

text
For production API inference → BF16 or FP16 on GPU
For local 7-8B models on consumer GPU → Q4_K_M or INT4
For local 7-8B on CPU → Q4_0 or Q5_K_M
For fine-tuning on limited VRAM → QLoRA (NF4)
For edge/mobile → INT8 with NPU