What are the various number formats used in AI models, and what do abbreviations like FP32, BF16, MXFP8, NVFP4, INT4, and Q4 mean?

#gen-ai#quantization#number-formats#fp32#bf16#fp8#int4#nvfp4#precision#inference

Answer

AI Number Formats: From FP32 to NVFP4

A number format defines how bits represent a numerical value — how many bits for the sign, exponent, and mantissa. This directly determines precision, dynamic range, memory usage, and inference speed.

Format Family Overview

text
                          AI Number Formats
                  _____________|_____________
                 |                           |
            Floating-Point                  Integer
         ________|_________              _____|_____
        |                  |            |           |
    Wide (32-bit)     Narrow (4-16 bit)  Signed       Unsigned
    FP32, TF32         FP16, BF16        INT8, INT4    UINT8
                       FP8, MXFP8
                       NVFP4, FP4

The Complete Format Table

FormatBitsSignExponentMantissaValuesIn Use Since
FP32321823±3.4×10³⁸Always (training)
TF32191810±3.4×10³⁸Ampere (2020)
FP16161510±65,504Volta (2017)
BF1616187±3.4×10³⁸TPUv2 (2018)
FP8 E4M38143±448Hopper (2023)
FP8 E5M28152±57,344Hopper (2023)
MXFP88/block1variesvariesPer-block scaledBlackwell (2024)
INT8817−128 to +127Turing (2018)
FP44121±0 to ±6Experimental
NVFP44121±0 to ±6Blackwell (2024)
INT4413−8 to +7Various
Q4 / Q4_K_M~4.5Blockwise (GGUF)llama.cpp

Deep Dive: What Each Bit Layout Means

FP32 (IEEE 754 Single Precision) — The reference standard:

text
Bit layout: [S] [EEEEEEEE] [MMMMMMMMMMMMMMMMMMMMMMM]
            1b    8b                 23b

Precision: ~7 decimal digits
Dynamic range: 1.2×10⁻³⁸ to 3.4×10³⁸
Use: Training (gold standard), never for inference

BF16 (Brain Float) — Same exponent range as FP32, half the mantissa:

text
Bit layout: [S] [EEEEEEEE] [MMMMMMM]
            1b    8b          7b

Precision: ~2 decimal digits (less than FP16!)
Dynamic range: Same as FP32 (±3.4×10³⁸) ← KEY advantage
Use: Training (popular: TPUs, A100+, RTX 40xx)

# Why BF16 > FP16 for training:
# Same exponent as FP32 means BF16 can represent very large/small numbers
# Half the mantissa means less precision, but training is noise-tolerant

FP16 (IEEE Half Precision) — Narrow range, narrow precision:

text
Bit layout: [S] [EEEEE] [MMMMMMMMMM]
            1b   5b        10b

Precision: ~3 decimal digits
Dynamic range: ±6.55×10⁻⁵ to ±65,504
Use: Inference, some training (requires loss scaling)

# Problem: FP16 exponent range is too small for gradients
# Gradients can be <6.55×10⁻⁵ → underflow to zero with FP16!

FP8 (Hopper) — Two variants for different tasks:

VariantExponentMantissaRangePrecisionUse
E4M34 bits3 bits±0.00195 to ±448~1 decimal digitForward pass (weights, activations)
E5M25 bits2 bits±0.000015 to ±57,344~0.5 decimal digitBackward pass (gradients need range)

NVIDIA FP4 / NVFP4 (E2M1) — The smallest floating format in production:

text
Bit layout: [S] [EE] [M]
            1b  2b   1b

16 total values (8 positive, 7 negative, ±0):
0, ±0.5, ±1.0, ±1.5, ±2.0, ±3.0, ±4.0, ±6.0

Key properties:
- Non-uniform spacing: dense near zero, sparse at extremes
- No per-group scale needed (exponent handles range)
- Used on Blackwell B200 Tensor Cores

MXFP8 (Microscaling FP8) — Block-based scaling:

text
Unlike regular FP8, MXFP8 groups values into blocks (typically 32 elements).
Each block gets a shared 8-bit scale factor, then each element uses
E4M3 or E5M2 within the block.

Block of 32 values:
[scale_8bit] [v1_e4m3] [v2_e4m3] ... [v32_e4m3]

Benefit: Per-block scaling handles outlier channels without
         needing per-element exponent bits → better precision
         with same bit budget.

INT8 / INT4 — Pure integer quantization:

FormatBitsValuesStep Size
INT88-128 to +127Uniform (× scale)
INT44-8 to +7Uniform (× scale)

GGML/llama.cpp Q4_K_M — Blockwise quantization for local LLMs:

text
GGML naming convention:
Q = Quantized
4 = 4-bit weights on average
K = "K-quant" (importance-weighted)
M = Medium (balanced quality/speed)

Other variants:
Q4_0    — Legacy 4-bit (groups of 32, fp16 scale)
Q4_K_S  — Small (faster, slightly lower quality)
Q4_K_M  — Medium (recommended default)
Q5_K_M  — 5-bit (higher quality, ~20% larger)
Q8_0    — 8-bit (highest quality, ~2x larger than Q4)
IQ4_XS  — Importance-matrix 4-bit (best quality for size)

Visual: Precision vs Range Trade-off

text
Precision (decimal digits)
    ^
  7 |  FP32 ●
    |
  3 |  FP16 ●
    |
  2 |        ● BF16
    |            (better range, lower precision)
  1 |                  ● FP8 E4M3
    |
0.5 |                              ● NVFP4
    |
    +------------------------------------------> Dynamic Range (log scale)
    10⁻⁵    10⁻³    10⁻¹    10¹     10³     10⁵     10³⁸
                        INT4 ──────> (8 values, uniform)

The Quantization Pipeline in Practice

Quick Memory Guide Per Billion Parameters

FormatBits/WeightGB per 1B params7B model70B model
FP32324.0 GB28 GB280 GB
FP16 / BF16162.0 GB14 GB140 GB
INT881.0 GB7 GB70 GB
FP881.0 GB7 GB70 GB
Q4_K_M~4.5~0.56 GB~4 GB~40 GB
INT4 / NVFP440.5 GB3.5 GB35 GB

Summary: When to Use Each Format

FormatTrainingInferenceKey Reason
FP32Gold standard (master weights)NeverToo large, unnecessary precision
BF16Recommended (A100+, H100, RTX 40xx)GoodSame range as FP32, half the size
FP16Possible (needs loss scaling)GoodWide hardware support
TF32Automatic on Ampere+ with FP32 codeN/AFree 8x speedup, same range
FP8Emerging (H100, B200)Emerging4x smaller than FP16, native on Hopper/Blackwell
MXFP8Future (Blackwell)FuturePer-block scaling outperforms plain FP8
INT8Rare (QAT)Standard4x smaller, works on most GPUs
INT4NoCommon8x smaller, needs calibration
NVFP4NoBlackwell-native8x smaller, non-uniform, no scale tensor
Q4_K_MNoLocal LLMs (llama.cpp)Best quality/size trade-off for CPUs/Macs

The TL;DR: FP32 = training reference. BF16 = training standard (same range as FP32). INT8/INT4 = inference compression. FP8/NVFP4/MXFP8 = the future frontier. Q4_K_M = what you use to run Llama on your laptop.

Learn more at NVIDIA FP8 Training Whitepaper and GGML Quantization Types.