Concept #126Mediumextended-ai-concepts

What is quantization in AI/LLM?

#gen-ai#quantization#llm

Answer

What is Quantization in AI/LLM?

Quantization is the process of reducing the precision of a model's numerical values (weights and activations) from higher-bit formats (float32) to lower-bit formats (int8, int4) — dramatically reducing model size and memory requirements with minimal quality loss.

Why Quantization Matters

A 70B parameter model at full precision (FP32) needs 280 GB of GPU memory — impossible on most hardware. Quantization makes it practical:

PrecisionBytes/param70B Model SizeVRAM Needed
FP324 bytes280 GB4-8 × A100 80GB
FP16/BF162 bytes140 GB2-4 × A100 80GB
INT81 byte70 GB1 × A100 80GB
INT40.5 bytes35 GB2 × RTX 4090
INT20.25 bytes17.5 GBExperimental

How Quantization Works

text
Original weight: 0.45892 (FP32 — 32 bits, high precision)
INT8 quantization:
  scale = max_value / 127
  quantized = round(0.45892 / scale) = 58 (INT8 — 8 bits)

  To use: dequantize = 58 × scale ≈ 0.457 (small error, acceptable)

Types at a Glance

TypePrecisionSize reductionQuality loss
FP32 → FP1616-bit float2x smallerMinimal
FP32 → BF1616-bit brain float2x smallerMinimal
FP32 → INT88-bit integer4x smallerVery small
FP32 → INT44-bit integer8x smallerSmall-medium
FP32 → INT22-bit integer16x smallerSignificant

Using Quantization in Practice

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization (QLoRA-style, runs 7B on 6GB VRAM)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"  # Normal Float 4 — better than INT4
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto"
)
# Model uses ~4 GB VRAM instead of ~16 GB

Ollama (Easiest Quantized Models)

bash
# Download pre-quantized models
ollama pull llama3.1:8b           # 4-bit quantized ~4.7GB
ollama pull llama3.1:8b-instruct-q8_0  # 8-bit ~8.5GB

# Quantization level in model name
# q4_0, q4_K_M, q5_K_M, q8_0 — higher number = better quality

Quality vs Efficiency Trade-off

text
Quality
  │  FP32 ●────────────────────── (highest quality, highest cost)
  │  BF16 ●──────────────────── (nearly same quality, 2x smaller)
  │  INT8 ●──────────────── (small quality drop, 4x smaller)
  │  INT4 ●─────────── (noticeable drop on edge cases, 8x smaller)
  │  INT2 ●──── (significant quality loss, experimental)
  └──────────────────────────────→ Memory Efficiency

When to Quantize

SituationQuantization Level
Production cloud (VRAM available)BF16 or FP16
Consumer GPU (RTX 3080/4090)INT4 or INT8
Apple Silicon MacFP16 via MPS or INT4
Edge device / phoneINT4 or INT8 with NPU
Research (maximum accuracy)FP32