Concept #151Hardextended-ai-concepts

What is quantization and what are all the different ways to do it?

#quantization#int4#int8#gguf#gptq#awq#fp8#llm

Answer

What is Quantization in AI?

Quantization reduces the numerical precision of model weights (and activations) from high-precision floats (FP32/BF16) to lower-precision formats (FP16, INT8, INT4). This shrinks model size, speeds up inference, and reduces memory usage — with a small trade-off in accuracy.


Why Quantization Matters

text
FP32 model (LLaMA 3.2 3B):   ~12 GB VRAM
BF16 model:                    ~6 GB VRAM
INT8 model:                    ~3 GB VRAM
INT4 model (QLoRA/GGUF):       ~1.5 GB VRAM  ← runs on a laptop!

Precision Formats Overview

FormatBitsRangeUse Case
FP3232±3.4×10³⁸Training (full precision)
BF1616±3.4×10³⁸Training / inference (same range as FP32)
FP1616±65,504Inference (narrower range, can overflow)
FP88Two variants (E4M3, E5M2)H100 training & inference
INT88-128 to 127Inference (good accuracy/speed balance)
INT44-8 to 7Aggressive compression, slight quality loss

Types of Quantization

1. Post-Training Quantization (PTQ)

Quantize after training — no retraining needed. Fastest to apply.

python
# Simple FP16 inference (half precision)
import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    torch_dtype=torch.float16,   # FP16
    device_map="auto"
)

2. BitsAndBytes Quantization (INT8 / INT4)

The easiest way to quantize HuggingFace models on consumer GPUs:

python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# INT8 quantization
bnb_int8 = BitsAndBytesConfig(load_in_8bit=True)

# INT4 quantization (NF4 — best for LLMs)
bnb_int4 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 — best quality
    bnb_4bit_use_double_quant=True,      # extra compression
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantization_config=bnb_int4,
    device_map="auto"
)

3. GPTQ (Post-Training Quantization with Calibration)

Quantizes weights per-layer using a small calibration dataset. Better accuracy than naive INT4:

python
from transformers import AutoModelForCausalLM, GPTQConfig, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

gptq_config = GPTQConfig(
    bits=4,
    dataset="wikitext2",     # calibration dataset
    tokenizer=tokenizer
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantization_config=gptq_config,
    device_map="auto"
)
model.save_pretrained("./llama3-gptq-4bit")

4. AWQ (Activation-Aware Weight Quantization)

Protects salient weights (weights that strongly affect output) during quantization. Often better than GPTQ:

python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-3.2-3B"
quant_path = "./llama3-awq-4bit"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

5. GGUF / llama.cpp (CPU-Friendly Quantization)

GGUF is the format used by

text
llama.cpp
— enables running quantized LLMs on CPU or Apple Silicon:

bash
# Convert to GGUF and quantize
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

python3 convert_hf_to_gguf.py meta-llama/Llama-3.2-3B --outfile llama3.gguf

# Quantize to Q4_K_M (best quality/size tradeoff for 4-bit)
./llama-quantize llama3.gguf llama3-Q4_K_M.gguf Q4_K_M
python
# Run with Python via llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="llama3-Q4_K_M.gguf", n_ctx=4096)
output = llm("Explain quantization in simple terms:", max_tokens=200)
print(output["choices"][0]["text"])

6. FP8 Quantization (H100 / Modern Hardware)

FP8 is the newest format, supported on NVIDIA H100 GPUs:

python
from transformers import AutoModelForCausalLM
import torch

# FP8 — requires H100 or newer GPU
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    torch_dtype=torch.float8_e4m3fn,  # FP8 E4M3 variant
    device_map="auto"
)

7. Quantization-Aware Training (QAT)

Simulate quantization during training so the model adapts to reduced precision:

python
import torch.quantization

model.train()
model.qconfig = torch.quantization.get_default_qat_qconfig("fbgemm")
torch.quantization.prepare_qat(model, inplace=True)

# ... normal training loop ...

model.eval()
torch.quantization.convert(model, inplace=True)  # convert to quantized

Comparison of Quantization Methods

MethodBitsQuality LossSpeed GainVRAM SavedUse Case
FP1616Negligible1.5×50%Standard inference
BF1616Negligible1.5×50%Training + inference
INT8 (bitsandbytes)8Very low75%General inference
INT4 NF44Low3–4×87%Consumer GPU inference
GPTQ 4-bit4Low3–4×87%Production INT4
AWQ 4-bit4Very low3–4×87%Best INT4 accuracy
GGUF Q4_K_M4Low87%CPU / Apple Silicon
FP88Very low75%H100 training
QAT4–8Minimal3–4×75–87%Best accuracy for size

How to Choose

text
Running on CPU / Apple Silicon?
    → GGUF (Q4_K_M or Q5_K_M)

Running on consumer GPU (RTX 3090/4090)?
    → bitsandbytes INT4 (NF4) or GPTQ

Running on datacenter GPU (A100)?
    → AWQ or GPTQ (best accuracy)

Running on H100?
    → FP8

Need to fine-tune a quantized model?
    → QLoRA (INT4 NF4 via bitsandbytes)

Rule of thumb: Start with

text
Q4_K_M
(GGUF) for CPU or
text
NF4
(bitsandbytes) for GPU. Move to AWQ if you need better accuracy at the same bit width.