What is quantization and what are all the different ways to do it?

Question

Accepted Answer

## What is Quantization in AI?

**Quantization** reduces the numerical precision of model weights (and activations) from high-precision floats (FP32/BF16) to lower-precision formats (FP16, INT8, INT4). This shrinks model size, speeds up inference, and reduces memory usage — with a small trade-off in accuracy.

---

## Why Quantization Matters

```
FP32 model (LLaMA 3.2 3B):   ~12 GB VRAM
BF16 model:                    ~6 GB VRAM
INT8 model:                    ~3 GB VRAM
INT4 model (QLoRA/GGUF):       ~1.5 GB VRAM  ← runs on a laptop!
```

---

## Precision Formats Overview

| Format | Bits | Range | Use Case |
|--------|------|-------|----------|
| **FP32** | 32 | ±3.4×10³⁸ | Training (full precision) |
| **BF16** | 16 | ±3.4×10³⁸ | Training / inference (same range as FP32) |
| **FP16** | 16 | ±65,504 | Inference (narrower range, can overflow) |
| **FP8** | 8 | Two variants (E4M3, E5M2) | H100 training & inference |
| **INT8** | 8 | -128 to 127 | Inference (good accuracy/speed balance) |
| **INT4** | 4 | -8 to 7 | Aggressive compression, slight quality loss |

---

## Types of Quantization

### 1. Post-Training Quantization (PTQ)

Quantize after training — no retraining needed. Fastest to apply.

```python
# Simple FP16 inference (half precision)
import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    torch_dtype=torch.float16,   # FP16
    device_map="auto"
)
```

---

### 2. BitsAndBytes Quantization (INT8 / INT4)

The easiest way to quantize HuggingFace models on consumer GPUs:

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# INT8 quantization
bnb_int8 = BitsAndBytesConfig(load_in_8bit=True)

# INT4 quantization (NF4 — best for LLMs)
bnb_int4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 — best quality
    bnb_4bit_use_double_quant=True,      # extra compression
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantization_config=bnb_int4,
    device_map="auto"
)
```

---

### 3. GPTQ (Post-Training Quantization with Calibration)

Quantizes weights per-layer using a small calibration dataset. Better accuracy than naive INT4:

```python
from transformers import AutoModelForCausalLM, GPTQConfig, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

gptq_config = GPTQConfig(
    bits=4,
    dataset="wikitext2",     # calibration dataset
    tokenizer=tokenizer
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantization_config=gptq_config,
    device_map="auto"
)
model.save_pretrained("./llama3-gptq-4bit")
```

---

### 4. AWQ (Activation-Aware Weight Quantization)

Protects salient weights (weights that strongly affect output) during quantization. Often better than GPTQ:

```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.2-3B"
quant_path = "./llama3-awq-4bit"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
```

---

### 5. GGUF / llama.cpp (CPU-Friendly Quantization)

GGUF is the format used by `llama.cpp` — enables running quantized LLMs on CPU or Apple Silicon:

```bash
# Convert to GGUF and quantize
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

python3 convert_hf_to_gguf.py meta-llama/Llama-3.2-3B --outfile llama3.gguf

# Quantize to Q4_K_M (best quality/size tradeoff for 4-bit)
./llama-quantize llama3.gguf llama3-Q4_K_M.gguf Q4_K_M
```

```python
# Run with Python via llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="llama3-Q4_K_M.gguf", n_ctx=4096)
output = llm("Explain quantization in simple terms:", max_tokens=200)
print(output["choices"][0]["text"])
```

---

### 6. FP8 Quantization (H100 / Modern Hardware)

FP8 is the newest format, supported on NVIDIA H100 GPUs:

```python
from transformers import AutoModelForCausalLM
import torch

# FP8 — requires H100 or newer GPU
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    torch_dtype=torch.float8_e4m3fn,  # FP8 E4M3 variant
    device_map="auto"
)
```

---

### 7. Quantization-Aware Training (QAT)

Simulate quantization during training so the model adapts to reduced precision:

```python
import torch.quantization

model.train()
model.qconfig = torch.quantization.get_default_qat_qconfig("fbgemm")
torch.quantization.prepare_qat(model, inplace=True)

# ... normal training loop ...

model.eval()
torch.quantization.convert(model, inplace=True)  # convert to quantized
```

---

## Comparison of Quantization Methods

| Method | Bits | Quality Loss | Speed Gain | VRAM Saved | Use Case |
|--------|------|-------------|------------|------------|----------|
| **FP16** | 16 | Negligible | 1.5× | 50% | Standard inference |
| **BF16** | 16 | Negligible | 1.5× | 50% | Training + inference |
| **INT8 (bitsandbytes)** | 8 | Very low | 2× | 75% | General inference |
| **INT4 NF4** | 4 | Low | 3–4× | 87% | Consumer GPU inference |
| **GPTQ 4-bit** | 4 | Low | 3–4× | 87% | Production INT4 |
| **AWQ 4-bit** | 4 | Very low | 3–4× | 87% | Best INT4 accuracy |
| **GGUF Q4_K_M** | 4 | Low | 3× | 87% | CPU / Apple Silicon |
| **FP8** | 8 | Very low | 2× | 75% | H100 training |
| **QAT** | 4–8 | Minimal | 3–4× | 75–87% | Best accuracy for size |

---

## How to Choose

```
Running on CPU / Apple Silicon?
    → GGUF (Q4_K_M or Q5_K_M)

Running on consumer GPU (RTX 3090/4090)?
    → bitsandbytes INT4 (NF4) or GPTQ

Running on datacenter GPU (A100)?
    → AWQ or GPTQ (best accuracy)

Running on H100?
    → FP8

Need to fine-tune a quantized model?
    → QLoRA (INT4 NF4 via bitsandbytes)
```

> **Rule of thumb:** Start with `Q4_K_M` (GGUF) for CPU or `NF4` (bitsandbytes) for GPU. Move to AWQ if you need better accuracy at the same bit width.

What is quantization and what are all the different ways to do it?

Answer

What is Quantization in AI?

Why Quantization Matters

Precision Formats Overview

Types of Quantization

1. Post-Training Quantization (PTQ)

2. BitsAndBytes Quantization (INT8 / INT4)

3. GPTQ (Post-Training Quantization with Calibration)

4. AWQ (Activation-Aware Weight Quantization)

5. GGUF / llama.cpp (CPU-Friendly Quantization)

6. FP8 Quantization (H100 / Modern Hardware)

7. Quantization-Aware Training (QAT)

Comparison of Quantization Methods

How to Choose

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Format	Bits	Range	Use Case
FP32	32	±3.4×10³⁸	Training (full precision)
BF16	16	±3.4×10³⁸	Training / inference (same range as FP32)
FP16	16	±65,504	Inference (narrower range, can overflow)
FP8	8	Two variants (E4M3, E5M2)	H100 training & inference
INT8	8	-128 to 127	Inference (good accuracy/speed balance)
INT4	4	-8 to 7	Aggressive compression, slight quality loss

Method	Bits	Quality Loss	Speed Gain	VRAM Saved	Use Case
FP16	16	Negligible	1.5×	50%	Standard inference
BF16	16	Negligible	1.5×	50%	Training + inference
INT8 (bitsandbytes)	8	Very low	2×	75%	General inference
INT4 NF4	4	Low	3–4×	87%	Consumer GPU inference
GPTQ 4-bit	4	Low	3–4×	87%	Production INT4
AWQ 4-bit	4	Very low	3–4×	87%	Best INT4 accuracy
GGUF Q4_K_M	4	Low	3×	87%	CPU / Apple Silicon
FP8	8	Very low	2×	75%	H100 training
QAT	4–8	Minimal	3–4×	75–87%	Best accuracy for size