What are all the types of quantization in AI?

Question

Accepted Answer

## Types of Quantization in AI

Quantization comes in many flavors — differing in when quantization is applied, what format is used, and the accuracy/performance tradeoff.

### By Format (Data Type)

| Format | Bits | Description |
|--------|------|-------------|
| **FP32** | 32 | Full precision float (standard training) |
| **TF32** | 19 effective | Tensor Float 32 — NVIDIA A100 training |
| **BF16** | 16 | Brain Float 16 — same range as FP32, good for training |
| **FP16** | 16 | Half precision — inference and mixed-precision training |
| **FP8** | 8 | 8-bit float — H100 training, two variants (E4M3, E5M2) |
| **INT8** | 8 | 8-bit integer — inference |
| **INT4** | 4 | 4-bit integer — aggressive compression |
| **NF4** | 4 | Normal Float 4 — better distribution for neural nets |
| **INT2** | 2 | 2-bit — experimental, high quality loss |
| **Binary** | 1 | 1-bit (BitNet) — research stage |

### By When Applied

| Type | When | How |
|------|------|-----|
| **Post-Training Quantization (PTQ)** | After training | Apply to pre-trained model — no retraining |
| **Quantization-Aware Training (QAT)** | During training | Simulate quantization error during training |
| **Dynamic Quantization** | At inference time | Quantize activations dynamically per forward pass |
| **Static Quantization** | Before inference | Pre-compute activation scales using calibration data |

### Post-Training Quantization (PTQ)

Simplest — quantize a trained model with no retraining:

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# PTQ: apply 4-bit quantization to pre-trained model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config
)
# Weights stored in 4-bit, computation in BF16
```

### Quantization-Aware Training (QAT)

Better quality — trains with quantization simulation:

```python
import torch.quantization

model = MyModel()

# Add fake quantization nodes during training
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_qat = torch.quantization.prepare_qat(model)

# Train as normal — model learns to be robust to quantization
trainer.train(model_qat)

# Convert to fully quantized model
model_quantized = torch.quantization.convert(model_qat)
```

### GGUF Format (llama.cpp / Ollama)

Most local model tools use GGUF format with named quantization levels:

| GGUF Name | Bits per weight | Quality |
|-----------|----------------|---------|
| Q2_K | ~2.6 bits | Very low |
| Q3_K_M | ~3.4 bits | Low |
| Q4_0 | 4 bits | Medium |
| Q4_K_M | ~4.5 bits | **Good (recommended)** |
| Q5_K_M | ~5.7 bits | Better |
| Q6_K | ~6.6 bits | High |
| Q8_0 | 8 bits | Near lossless |
| F16 | 16 bits | Full half-precision |

```bash
# Ollama uses GGUF quantization
ollama pull llama3.1:8b             # Uses Q4_K_M by default
ollama pull llama3.1:8b-q8_0       # Q8 (better quality, larger)
```

### GPTQ vs AWQ vs GGUF

| Format | Creator | Approach | Best For |
|--------|---------|---------|---------|
| **GPTQ** | IST Austria | Layer-wise quantization | GPU inference |
| **AWQ** | MIT | Activation-aware weights | GPU inference (faster) |
| **GGUF** | llama.cpp | CPU+GPU hybrid | Local inference (Ollama) |
| **bitsandbytes** | Tim Dettmers | 4/8-bit with NF4 | Fine-tuning (QLoRA) |

### Practical Recommendation

```
For production API inference → BF16 or FP16 on GPU
For local 7-8B models on consumer GPU → Q4_K_M or INT4
For local 7-8B on CPU → Q4_0 or Q5_K_M
For fine-tuning on limited VRAM → QLoRA (NF4)
For edge/mobile → INT8 with NPU
```

What are all the types of quantization in AI?

Answer

Types of Quantization in AI

By Format (Data Type)

By When Applied

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

GGUF Format (llama.cpp / Ollama)

GPTQ vs AWQ vs GGUF

Practical Recommendation

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Format	Bits	Description
FP32	32	Full precision float (standard training)
TF32	19 effective	Tensor Float 32 — NVIDIA A100 training
BF16	16	Brain Float 16 — same range as FP32, good for training
FP16	16	Half precision — inference and mixed-precision training
FP8	8	8-bit float — H100 training, two variants (E4M3, E5M2)
INT8	8	8-bit integer — inference
INT4	4	4-bit integer — aggressive compression
NF4	4	Normal Float 4 — better distribution for neural nets
INT2	2	2-bit — experimental, high quality loss
Binary	1	1-bit (BitNet) — research stage

Type	When	How
Post-Training Quantization (PTQ)	After training	Apply to pre-trained model — no retraining
Quantization-Aware Training (QAT)	During training	Simulate quantization error during training
Dynamic Quantization	At inference time	Quantize activations dynamically per forward pass
Static Quantization	Before inference	Pre-compute activation scales using calibration data

GGUF Name	Bits per weight	Quality
Q2_K	~2.6 bits	Very low
Q3_K_M	~3.4 bits	Low
Q4_0	4 bits	Medium
Q4_K_M	~4.5 bits	Good (recommended)
Q5_K_M	~5.7 bits	Better
Q6_K	~6.6 bits	High
Q8_0	8 bits	Near lossless
F16	16 bits	Full half-precision

Format	Creator	Approach	Best For
GPTQ	IST Austria	Layer-wise quantization	GPU inference
AWQ	MIT	Activation-aware weights	GPU inference (faster)
GGUF	llama.cpp	CPU+GPU hybrid	Local inference (Ollama)
bitsandbytes	Tim Dettmers	4/8-bit with NF4	Fine-tuning (QLoRA)