What are the various number formats used in AI models, and what do abbreviations like FP32, BF16, MXFP8, NVFP4, INT4, and Q4 mean?
Answer
AI Number Formats: From FP32 to NVFP4
A number format defines how bits represent a numerical value — how many bits for the sign, exponent, and mantissa. This directly determines precision, dynamic range, memory usage, and inference speed.
Format Family Overview
textAI Number Formats _____________|_____________ | | Floating-Point Integer ________|_________ _____|_____ | | | | Wide (32-bit) Narrow (4-16 bit) Signed Unsigned FP32, TF32 FP16, BF16 INT8, INT4 UINT8 FP8, MXFP8 NVFP4, FP4
The Complete Format Table
| Format | Bits | Sign | Exponent | Mantissa | Values | In Use Since |
|---|---|---|---|---|---|---|
| FP32 | 32 | 1 | 8 | 23 | ±3.4×10³⁸ | Always (training) |
| TF32 | 19 | 1 | 8 | 10 | ±3.4×10³⁸ | Ampere (2020) |
| FP16 | 16 | 1 | 5 | 10 | ±65,504 | Volta (2017) |
| BF16 | 16 | 1 | 8 | 7 | ±3.4×10³⁸ | TPUv2 (2018) |
| FP8 E4M3 | 8 | 1 | 4 | 3 | ±448 | Hopper (2023) |
| FP8 E5M2 | 8 | 1 | 5 | 2 | ±57,344 | Hopper (2023) |
| MXFP8 | 8/block | 1 | varies | varies | Per-block scaled | Blackwell (2024) |
| INT8 | 8 | 1 | — | 7 | −128 to +127 | Turing (2018) |
| FP4 | 4 | 1 | 2 | 1 | ±0 to ±6 | Experimental |
| NVFP4 | 4 | 1 | 2 | 1 | ±0 to ±6 | Blackwell (2024) |
| INT4 | 4 | 1 | — | 3 | −8 to +7 | Various |
| Q4 / Q4_K_M | ~4.5 | — | — | — | Blockwise (GGUF) | llama.cpp |
Deep Dive: What Each Bit Layout Means
FP32 (IEEE 754 Single Precision) — The reference standard:
textBit layout: [S] [EEEEEEEE] [MMMMMMMMMMMMMMMMMMMMMMM] 1b 8b 23b Precision: ~7 decimal digits Dynamic range: 1.2×10⁻³⁸ to 3.4×10³⁸ Use: Training (gold standard), never for inference
BF16 (Brain Float) — Same exponent range as FP32, half the mantissa:
textBit layout: [S] [EEEEEEEE] [MMMMMMM] 1b 8b 7b Precision: ~2 decimal digits (less than FP16!) Dynamic range: Same as FP32 (±3.4×10³⁸) ← KEY advantage Use: Training (popular: TPUs, A100+, RTX 40xx) # Why BF16 > FP16 for training: # Same exponent as FP32 means BF16 can represent very large/small numbers # Half the mantissa means less precision, but training is noise-tolerant
FP16 (IEEE Half Precision) — Narrow range, narrow precision:
textBit layout: [S] [EEEEE] [MMMMMMMMMM] 1b 5b 10b Precision: ~3 decimal digits Dynamic range: ±6.55×10⁻⁵ to ±65,504 Use: Inference, some training (requires loss scaling) # Problem: FP16 exponent range is too small for gradients # Gradients can be <6.55×10⁻⁵ → underflow to zero with FP16!
FP8 (Hopper) — Two variants for different tasks:
| Variant | Exponent | Mantissa | Range | Precision | Use |
|---|---|---|---|---|---|
| E4M3 | 4 bits | 3 bits | ±0.00195 to ±448 | ~1 decimal digit | Forward pass (weights, activations) |
| E5M2 | 5 bits | 2 bits | ±0.000015 to ±57,344 | ~0.5 decimal digit | Backward pass (gradients need range) |
NVIDIA FP4 / NVFP4 (E2M1) — The smallest floating format in production:
textBit layout: [S] [EE] [M] 1b 2b 1b 16 total values (8 positive, 7 negative, ±0): 0, ±0.5, ±1.0, ±1.5, ±2.0, ±3.0, ±4.0, ±6.0 Key properties: - Non-uniform spacing: dense near zero, sparse at extremes - No per-group scale needed (exponent handles range) - Used on Blackwell B200 Tensor Cores
MXFP8 (Microscaling FP8) — Block-based scaling:
textUnlike regular FP8, MXFP8 groups values into blocks (typically 32 elements). Each block gets a shared 8-bit scale factor, then each element uses E4M3 or E5M2 within the block. Block of 32 values: [scale_8bit] [v1_e4m3] [v2_e4m3] ... [v32_e4m3] Benefit: Per-block scaling handles outlier channels without needing per-element exponent bits → better precision with same bit budget.
INT8 / INT4 — Pure integer quantization:
| Format | Bits | Values | Step Size |
|---|---|---|---|
| INT8 | 8 | -128 to +127 | Uniform (× scale) |
| INT4 | 4 | -8 to +7 | Uniform (× scale) |
GGML/llama.cpp Q4_K_M — Blockwise quantization for local LLMs:
textGGML naming convention: Q = Quantized 4 = 4-bit weights on average K = "K-quant" (importance-weighted) M = Medium (balanced quality/speed) Other variants: Q4_0 — Legacy 4-bit (groups of 32, fp16 scale) Q4_K_S — Small (faster, slightly lower quality) Q4_K_M — Medium (recommended default) Q5_K_M — 5-bit (higher quality, ~20% larger) Q8_0 — 8-bit (highest quality, ~2x larger than Q4) IQ4_XS — Importance-matrix 4-bit (best quality for size)
Visual: Precision vs Range Trade-off
textPrecision (decimal digits) ^ 7 | FP32 ● | 3 | FP16 ● | 2 | ● BF16 | (better range, lower precision) 1 | ● FP8 E4M3 | 0.5 | ● NVFP4 | +------------------------------------------> Dynamic Range (log scale) 10⁻⁵ 10⁻³ 10⁻¹ 10¹ 10³ 10⁵ 10³⁸ INT4 ──────> (8 values, uniform)
The Quantization Pipeline in Practice
Quick Memory Guide Per Billion Parameters
| Format | Bits/Weight | GB per 1B params | 7B model | 70B model |
|---|---|---|---|---|
| FP32 | 32 | 4.0 GB | 28 GB | 280 GB |
| FP16 / BF16 | 16 | 2.0 GB | 14 GB | 140 GB |
| INT8 | 8 | 1.0 GB | 7 GB | 70 GB |
| FP8 | 8 | 1.0 GB | 7 GB | 70 GB |
| Q4_K_M | ~4.5 | ~0.56 GB | ~4 GB | ~40 GB |
| INT4 / NVFP4 | 4 | 0.5 GB | 3.5 GB | 35 GB |
Summary: When to Use Each Format
| Format | Training | Inference | Key Reason |
|---|---|---|---|
| FP32 | Gold standard (master weights) | Never | Too large, unnecessary precision |
| BF16 | Recommended (A100+, H100, RTX 40xx) | Good | Same range as FP32, half the size |
| FP16 | Possible (needs loss scaling) | Good | Wide hardware support |
| TF32 | Automatic on Ampere+ with FP32 code | N/A | Free 8x speedup, same range |
| FP8 | Emerging (H100, B200) | Emerging | 4x smaller than FP16, native on Hopper/Blackwell |
| MXFP8 | Future (Blackwell) | Future | Per-block scaling outperforms plain FP8 |
| INT8 | Rare (QAT) | Standard | 4x smaller, works on most GPUs |
| INT4 | No | Common | 8x smaller, needs calibration |
| NVFP4 | No | Blackwell-native | 8x smaller, non-uniform, no scale tensor |
| Q4_K_M | No | Local LLMs (llama.cpp) | Best quality/size trade-off for CPUs/Macs |
The TL;DR: FP32 = training reference. BF16 = training standard (same range as FP32). INT8/INT4 = inference compression. FP8/NVFP4/MXFP8 = the future frontier. Q4_K_M = what you use to run Llama on your laptop.
Learn more at NVIDIA FP8 Training Whitepaper and GGML Quantization Types.