What are the various number formats used in AI models, and what do abbreviations like FP32, BF16, MXFP8, NVFP4, INT4, and Q4 mean?

Question

Accepted Answer

## AI Number Formats: From FP32 to NVFP4

A **number format** defines how bits represent a numerical value — how many bits for the sign, exponent, and mantissa. This directly determines precision, dynamic range, memory usage, and inference speed.

### Format Family Overview

```
                          AI Number Formats
                  _____________|_____________
                 |                           |
            Floating-Point                  Integer
         ________|_________              _____|_____
        |                  |            |           |
    Wide (32-bit)     Narrow (4-16 bit)  Signed       Unsigned
    FP32, TF32         FP16, BF16        INT8, INT4    UINT8
                       FP8, MXFP8
                       NVFP4, FP4
```

### The Complete Format Table

| Format | Bits | Sign | Exponent | Mantissa | Values | In Use Since |
|--------|------|------|----------|----------|--------|-------------|
| **FP32** | 32 | 1 | 8 | 23 | ±3.4×10³⁸ | Always (training) |
| **TF32** | 19 | 1 | 8 | 10 | ±3.4×10³⁸ | Ampere (2020) |
| **FP16** | 16 | 1 | 5 | 10 | ±65,504 | Volta (2017) |
| **BF16** | 16 | 1 | 8 | 7 | ±3.4×10³⁸ | TPUv2 (2018) |
| **FP8 E4M3** | 8 | 1 | 4 | 3 | ±448 | Hopper (2023) |
| **FP8 E5M2** | 8 | 1 | 5 | 2 | ±57,344 | Hopper (2023) |
| **MXFP8** | 8/block | 1 | varies | varies | Per-block scaled | Blackwell (2024) |
| **INT8** | 8 | 1 | — | 7 | −128 to +127 | Turing (2018) |
| **FP4** | 4 | 1 | 2 | 1 | ±0 to ±6 | Experimental |
| **NVFP4** | 4 | 1 | 2 | 1 | ±0 to ±6 | Blackwell (2024) |
| **INT4** | 4 | 1 | — | 3 | −8 to +7 | Various |
| **Q4 / Q4_K_M** | ~4.5 | — | — | — | Blockwise (GGUF) | llama.cpp |

### Deep Dive: What Each Bit Layout Means

**FP32 (IEEE 754 Single Precision)** — The reference standard:

```
Bit layout: [S] [EEEEEEEE] [MMMMMMMMMMMMMMMMMMMMMMM]
            1b    8b                 23b

Precision: ~7 decimal digits
Dynamic range: 1.2×10⁻³⁸ to 3.4×10³⁸
Use: Training (gold standard), never for inference
```

**BF16 (Brain Float)** — Same exponent range as FP32, half the mantissa:

```
Bit layout: [S] [EEEEEEEE] [MMMMMMM]
            1b    8b          7b

Precision: ~2 decimal digits (less than FP16!)
Dynamic range: Same as FP32 (±3.4×10³⁸) ← KEY advantage
Use: Training (popular: TPUs, A100+, RTX 40xx)

# Why BF16 > FP16 for training:
# Same exponent as FP32 means BF16 can represent very large/small numbers
# Half the mantissa means less precision, but training is noise-tolerant
```

**FP16 (IEEE Half Precision)** — Narrow range, narrow precision:

```
Bit layout: [S] [EEEEE] [MMMMMMMMMM]
            1b   5b        10b

Precision: ~3 decimal digits
Dynamic range: ±6.55×10⁻⁵ to ±65,504
Use: Inference, some training (requires loss scaling)

# Problem: FP16 exponent range is too small for gradients
# Gradients can be <6.55×10⁻⁵ → underflow to zero with FP16!
```

**FP8 (Hopper)** — Two variants for different tasks:

| Variant | Exponent | Mantissa | Range | Precision | Use |
|---------|----------|----------|-------|-----------|-----|
| **E4M3** | 4 bits | 3 bits | ±0.00195 to ±448 | ~1 decimal digit | Forward pass (weights, activations) |
| **E5M2** | 5 bits | 2 bits | ±0.000015 to ±57,344 | ~0.5 decimal digit | Backward pass (gradients need range) |

**NVIDIA FP4 / NVFP4 (E2M1)** — The smallest floating format in production:

```
Bit layout: [S] [EE] [M]
            1b  2b   1b

16 total values (8 positive, 7 negative, ±0):
0, ±0.5, ±1.0, ±1.5, ±2.0, ±3.0, ±4.0, ±6.0

Key properties:
- Non-uniform spacing: dense near zero, sparse at extremes
- No per-group scale needed (exponent handles range)
- Used on Blackwell B200 Tensor Cores
```

**MXFP8 (Microscaling FP8)** — Block-based scaling:

```
Unlike regular FP8, MXFP8 groups values into blocks (typically 32 elements).
Each block gets a shared 8-bit scale factor, then each element uses
E4M3 or E5M2 within the block.

Block of 32 values:
[scale_8bit] [v1_e4m3] [v2_e4m3] ... [v32_e4m3]

Benefit: Per-block scaling handles outlier channels without
         needing per-element exponent bits → better precision
         with same bit budget.
```

**INT8 / INT4** — Pure integer quantization:

| Format | Bits | Values | Step Size |
|--------|------|--------|-----------|
| **INT8** | 8 | -128 to +127 | Uniform (× scale) |
| **INT4** | 4 | -8 to +7 | Uniform (× scale) |

**GGML/llama.cpp Q4_K_M** — Blockwise quantization for local LLMs:

```
GGML naming convention:
Q = Quantized
4 = 4-bit weights on average
K = "K-quant" (importance-weighted)
M = Medium (balanced quality/speed)

Other variants:
Q4_0    — Legacy 4-bit (groups of 32, fp16 scale)
Q4_K_S  — Small (faster, slightly lower quality)
Q4_K_M  — Medium (recommended default)
Q5_K_M  — 5-bit (higher quality, ~20% larger)
Q8_0    — 8-bit (highest quality, ~2x larger than Q4)
IQ4_XS  — Importance-matrix 4-bit (best quality for size)
```

### Visual: Precision vs Range Trade-off

```
Precision (decimal digits)
    ^
  7 |  FP32 ●
    |
  3 |  FP16 ●
    |
  2 |        ● BF16
    |            (better range, lower precision)
  1 |                  ● FP8 E4M3
    |
0.5 |                              ● NVFP4
    |
    +------------------------------------------> Dynamic Range (log scale)
    10⁻⁵    10⁻³    10⁻¹    10¹     10³     10⁵     10³⁸
                        INT4 ──────> (8 values, uniform)
```

### The Quantization Pipeline in Practice

```mermaid
graph LR
    T[Training: FP32 weights] --> Q1[Post-Training Quantization]
    Q1 --> F16[FP16 - 2x smaller]
    Q1 --> BF16[BF16 - 2x smaller, same range]
    Q1 --> INT8[INT8 - 4x smaller]
    Q1 --> INT4[INT4 - 8x smaller]
    Q1 --> NVFP4[NVFP4 - 8x smaller, non-uniform]
    INT8 --> Calib[Calibration needed]
    INT4 --> Calib
    NVFP4 --> Native[Native on Blackwell GPUs]
```

### Quick Memory Guide Per Billion Parameters

| Format | Bits/Weight | GB per 1B params | 7B model | 70B model |
|--------|------------|-----------------|----------|-----------|
| FP32 | 32 | 4.0 GB | 28 GB | 280 GB |
| FP16 / BF16 | 16 | 2.0 GB | 14 GB | 140 GB |
| INT8 | 8 | 1.0 GB | 7 GB | 70 GB |
| FP8 | 8 | 1.0 GB | 7 GB | 70 GB |
| Q4_K_M | ~4.5 | ~0.56 GB | ~4 GB | ~40 GB |
| INT4 / NVFP4 | 4 | 0.5 GB | 3.5 GB | 35 GB |

### Summary: When to Use Each Format

| Format | Training | Inference | Key Reason |
|--------|----------|-----------|------------|
| **FP32** | Gold standard (master weights) | Never | Too large, unnecessary precision |
| **BF16** | Recommended (A100+, H100, RTX 40xx) | Good | Same range as FP32, half the size |
| **FP16** | Possible (needs loss scaling) | Good | Wide hardware support |
| **TF32** | Automatic on Ampere+ with FP32 code | N/A | Free 8x speedup, same range |
| **FP8** | Emerging (H100, B200) | Emerging | 4x smaller than FP16, native on Hopper/Blackwell |
| **MXFP8** | Future (Blackwell) | Future | Per-block scaling outperforms plain FP8 |
| **INT8** | Rare (QAT) | Standard | 4x smaller, works on most GPUs |
| **INT4** | No | Common | 8x smaller, needs calibration |
| **NVFP4** | No | Blackwell-native | 8x smaller, non-uniform, no scale tensor |
| **Q4_K_M** | No | Local LLMs (llama.cpp) | Best quality/size trade-off for CPUs/Macs |

> **The TL;DR:** FP32 = training reference. BF16 = training standard (same range as FP32). INT8/INT4 = inference compression. FP8/NVFP4/MXFP8 = the future frontier. Q4_K_M = what you use to run Llama on your laptop.

Learn more at [NVIDIA FP8 Training Whitepaper](https://arxiv.org/abs/2209.05433) and [GGML Quantization Types](https://github.com/ggerganov/llama.cpp/pull/1684).

What are the various number formats used in AI models, and what do abbreviations like FP32, BF16, MXFP8, NVFP4, INT4, and Q4 mean?

Answer

AI Number Formats: From FP32 to NVFP4

Format Family Overview

The Complete Format Table

Deep Dive: What Each Bit Layout Means

Visual: Precision vs Range Trade-off

The Quantization Pipeline in Practice

Quick Memory Guide Per Billion Parameters

Summary: When to Use Each Format

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Format	Bits	Sign	Exponent	Mantissa	Values	In Use Since
FP32	32	1	8	23	±3.4×10³⁸	Always (training)
TF32	19	1	8	10	±3.4×10³⁸	Ampere (2020)
FP16	16	1	5	10	±65,504	Volta (2017)
BF16	16	1	8	7	±3.4×10³⁸	TPUv2 (2018)
FP8 E4M3	8	1	4	3	±448	Hopper (2023)
FP8 E5M2	8	1	5	2	±57,344	Hopper (2023)
MXFP8	8/block	1	varies	varies	Per-block scaled	Blackwell (2024)
INT8	8	1	—	7	−128 to +127	Turing (2018)
FP4	4	1	2	1	±0 to ±6	Experimental
NVFP4	4	1	2	1	±0 to ±6	Blackwell (2024)
INT4	4	1	—	3	−8 to +7	Various
Q4 / Q4_K_M	~4.5	—	—	—	Blockwise (GGUF)	llama.cpp

Variant	Exponent	Mantissa	Range	Precision	Use
E4M3	4 bits	3 bits	±0.00195 to ±448	~1 decimal digit	Forward pass (weights, activations)
E5M2	5 bits	2 bits	±0.000015 to ±57,344	~0.5 decimal digit	Backward pass (gradients need range)

Format	Bits	Values	Step Size
INT8	8	-128 to +127	Uniform (× scale)
INT4	4	-8 to +7	Uniform (× scale)

Format	Bits/Weight	GB per 1B params	7B model	70B model
FP32	32	4.0 GB	28 GB	280 GB
FP16 / BF16	16	2.0 GB	14 GB	140 GB
INT8	8	1.0 GB	7 GB	70 GB
FP8	8	1.0 GB	7 GB	70 GB
Q4_K_M	~4.5	~0.56 GB	~4 GB	~40 GB
INT4 / NVFP4	4	0.5 GB	3.5 GB	35 GB

Format	Training	Inference	Key Reason
FP32	Gold standard (master weights)	Never	Too large, unnecessary precision
BF16	Recommended (A100+, H100, RTX 40xx)	Good	Same range as FP32, half the size
FP16	Possible (needs loss scaling)	Good	Wide hardware support
TF32	Automatic on Ampere+ with FP32 code	N/A	Free 8x speedup, same range
FP8	Emerging (H100, B200)	Emerging	4x smaller than FP16, native on Hopper/Blackwell
MXFP8	Future (Blackwell)	Future	Per-block scaling outperforms plain FP8
INT8	Rare (QAT)	Standard	4x smaller, works on most GPUs
INT4	No	Common	8x smaller, needs calibration
NVFP4	No	Blackwell-native	8x smaller, non-uniform, no scale tensor
Q4_K_M	No	Local LLMs (llama.cpp)	Best quality/size trade-off for CPUs/Macs