What is Quantization in AI/LLM?
Quantization is the process of reducing the precision of a model's numerical values (weights and activations) from higher-bit formats (float32) to lower-bit formats (int8, int4) — dramatically reducing model size and memory requirements with minimal quality loss.
Why Quantization Matters
A 70B parameter model at full precision (FP32) needs 280 GB of GPU memory — impossible on most hardware. Quantization makes it practical:
| Precision | Bytes/param | 70B Model Size | VRAM Needed |
|---|
| FP32 | 4 bytes | 280 GB | 4-8 × A100 80GB |
| FP16/BF16 | 2 bytes | 140 GB | 2-4 × A100 80GB |
| INT8 | 1 byte | 70 GB | 1 × A100 80GB |
| INT4 | 0.5 bytes | 35 GB | 2 × RTX 4090 |
| INT2 | 0.25 bytes | 17.5 GB | Experimental |
How Quantization Works
Original weight: 0.45892 (FP32 — 32 bits, high precision)
↓
INT8 quantization:
scale = max_value / 127
quantized = round(0.45892 / scale) = 58 (INT8 — 8 bits)
To use: dequantize = 58 × scale ≈ 0.457 (small error, acceptable)
Types at a Glance
| Type | Precision | Size reduction | Quality loss |
|---|
| FP32 → FP16 | 16-bit float | 2x smaller | Minimal |
| FP32 → BF16 | 16-bit brain float | 2x smaller | Minimal |
| FP32 → INT8 | 8-bit integer | 4x smaller | Very small |
| FP32 → INT4 | 4-bit integer | 8x smaller | Small-medium |
| FP32 → INT2 | 2-bit integer | 16x smaller | Significant |
Using Quantization in Practice
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 4-bit quantization (QLoRA-style, runs 7B on 6GB VRAM)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4" # Normal Float 4 — better than INT4
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto"
)
# Model uses ~4 GB VRAM instead of ~16 GB
Ollama (Easiest Quantized Models)
# Download pre-quantized models
ollama pull llama3.1:8b # 4-bit quantized ~4.7GB
ollama pull llama3.1:8b-instruct-q8_0 # 8-bit ~8.5GB
# Quantization level in model name
# q4_0, q4_K_M, q5_K_M, q8_0 — higher number = better quality
Quality vs Efficiency Trade-off
Quality
↑
│ FP32 ●────────────────────── (highest quality, highest cost)
│ BF16 ●──────────────────── (nearly same quality, 2x smaller)
│ INT8 ●──────────────── (small quality drop, 4x smaller)
│ INT4 ●─────────── (noticeable drop on edge cases, 8x smaller)
│ INT2 ●──── (significant quality loss, experimental)
└──────────────────────────────→ Memory Efficiency
When to Quantize
| Situation | Quantization Level |
|---|
| Production cloud (VRAM available) | BF16 or FP16 |
| Consumer GPU (RTX 3080/4090) | INT4 or INT8 |
| Apple Silicon Mac | FP16 via MPS or INT4 |
| Edge device / phone | INT4 or INT8 with NPU |
| Research (maximum accuracy) | FP32 |