What is quantization in AI/LLM?

Question

Accepted Answer

## What is Quantization in AI/LLM?

**Quantization** is the process of reducing the precision of a model's numerical values (weights and activations) from higher-bit formats (float32) to lower-bit formats (int8, int4) — dramatically reducing model size and memory requirements with minimal quality loss.

### Why Quantization Matters

A 70B parameter model at full precision (FP32) needs 280 GB of GPU memory — impossible on most hardware. Quantization makes it practical:

| Precision | Bytes/param | 70B Model Size | VRAM Needed |
|-----------|-------------|---------------|-------------|
| FP32 | 4 bytes | 280 GB | 4-8 × A100 80GB |
| FP16/BF16 | 2 bytes | 140 GB | 2-4 × A100 80GB |
| INT8 | 1 byte | 70 GB | 1 × A100 80GB |
| INT4 | 0.5 bytes | 35 GB | 2 × RTX 4090 |
| INT2 | 0.25 bytes | 17.5 GB | Experimental |

### How Quantization Works

```
Original weight: 0.45892 (FP32 — 32 bits, high precision)
                    ↓
INT8 quantization:
  scale = max_value / 127
  quantized = round(0.45892 / scale) = 58 (INT8 — 8 bits)

To use: dequantize = 58 × scale ≈ 0.457 (small error, acceptable)
```

### Types at a Glance

| Type | Precision | Size reduction | Quality loss |
|------|-----------|---------------|-------------|
| FP32 → FP16 | 16-bit float | 2x smaller | Minimal |
| FP32 → BF16 | 16-bit brain float | 2x smaller | Minimal |
| FP32 → INT8 | 8-bit integer | 4x smaller | Very small |
| FP32 → INT4 | 4-bit integer | 8x smaller | Small-medium |
| FP32 → INT2 | 2-bit integer | 16x smaller | Significant |

### Using Quantization in Practice

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization (QLoRA-style, runs 7B on 6GB VRAM)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"  # Normal Float 4 — better than INT4
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto"
)
# Model uses ~4 GB VRAM instead of ~16 GB
```

### Ollama (Easiest Quantized Models)

```bash
# Download pre-quantized models
ollama pull llama3.1:8b           # 4-bit quantized ~4.7GB
ollama pull llama3.1:8b-instruct-q8_0  # 8-bit ~8.5GB

# Quantization level in model name
# q4_0, q4_K_M, q5_K_M, q8_0 — higher number = better quality
```

### Quality vs Efficiency Trade-off

```
Quality
  ↑
  │  FP32 ●────────────────────── (highest quality, highest cost)
  │  BF16 ●──────────────────── (nearly same quality, 2x smaller)
  │  INT8 ●──────────────── (small quality drop, 4x smaller)
  │  INT4 ●─────────── (noticeable drop on edge cases, 8x smaller)
  │  INT2 ●──── (significant quality loss, experimental)
  └──────────────────────────────→ Memory Efficiency
```

### When to Quantize

| Situation | Quantization Level |
|---------|-------------------|
| Production cloud (VRAM available) | BF16 or FP16 |
| Consumer GPU (RTX 3080/4090) | INT4 or INT8 |
| Apple Silicon Mac | FP16 via MPS or INT4 |
| Edge device / phone | INT4 or INT8 with NPU |
| Research (maximum accuracy) | FP32 |

What is quantization in AI/LLM?

Answer

What is Quantization in AI/LLM?

Why Quantization Matters

How Quantization Works

Types at a Glance

Using Quantization in Practice

Ollama (Easiest Quantized Models)

Quality vs Efficiency Trade-off

When to Quantize

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Precision	Bytes/param	70B Model Size	VRAM Needed
FP32	4 bytes	280 GB	4-8 × A100 80GB
FP16/BF16	2 bytes	140 GB	2-4 × A100 80GB
INT8	1 byte	70 GB	1 × A100 80GB
INT4	0.5 bytes	35 GB	2 × RTX 4090
INT2	0.25 bytes	17.5 GB	Experimental

Type	Precision	Size reduction	Quality loss
FP32 → FP16	16-bit float	2x smaller	Minimal
FP32 → BF16	16-bit brain float	2x smaller	Minimal
FP32 → INT8	8-bit integer	4x smaller	Very small
FP32 → INT4	4-bit integer	8x smaller	Small-medium
FP32 → INT2	2-bit integer	16x smaller	Significant

Situation	Quantization Level
Production cloud (VRAM available)	BF16 or FP16
Consumer GPU (RTX 3080/4090)	INT4 or INT8
Apple Silicon Mac	FP16 via MPS or INT4
Edge device / phone	INT4 or INT8 with NPU
Research (maximum accuracy)	FP32