How do INT4 and NVFP4 each split their 4 bits, and what does that mean for precision distribution, activation handling, and the W4A16 vs W4A4 trade-off?

Question

Accepted Answer

## INT4 vs NVFP4: Bit Layout, Precision Distribution, and Activation Handling

The core difference between INT4 and NVFP4 starts with how each format allocates its 4 bits — this determines the precision distribution, representable values, and ultimately what each format can and cannot express.

### How the 4 Bits Are Split

```
INT4 (Signed Integer):
┌───┬───┬───┬───┐
│ S │ V  │ V  │ V  │   S = sign bit, V = value bits
└───┴───┴───┴───┘
1 sign + 3 magnitude bits = 16 values (-8 to +7)

NVFP4 (E2M1 - Exponent 2 bits, Mantissa 1 bit):
┌───┬───┬───┬───┐
│ S │ E  │ E  │ M  │   S = sign, E = exponent, M = mantissa
└───┴───┴───┴───┘
1 sign + 2 exponent + 1 mantissa = 16 values (non-uniform)
```

### Complete Value Tables

**INT4 representable values (with scale factor s):**

| Bit Pattern | Decimal | With scale s=0.1 |
|------------|---------|-------------------|
| 0111 | +7 | +0.7 |
| 0110 | +6 | +0.6 |
| 0101 | +5 | +0.5 |
| 0100 | +4 | +0.4 |
| 0011 | +3 | +0.3 |
| 0010 | +2 | +0.2 |
| 0001 | +1 | +0.1 |
| 0000 | 0 | 0.0 |
| 1111 | -1 | -0.1 |
| 1110 | -2 | -0.2 |
| 1101 | -3 | -0.3 |
| 1100 | -4 | -0.4 |
| 1011 | -5 | -0.5 |
| 1010 | -6 | -0.6 |
| 1001 | -7 | -0.7 |
| 1000 | -8 | -0.8 |

Step size is **always 1.0 × s** — perfectly uniform.

**NVFP4 representable values (E2M1 format, bias=0, normal+subnormal):**

| Bit Pattern | Exponent | Mantissa | Exact Value |
|------------|----------|----------|-------------|
| 0 00 0 | 0 (subnormal) | 0 | 0.0 |
| 0 00 1 | 0 (subnormal) | 1 | 0.5 |
| 0 01 0 | 1 | 0 | 1.0 |
| 0 01 1 | 1 | 1 | 1.5 |
| 0 10 0 | 2 | 0 | 2.0 |
| 0 10 1 | 2 | 1 | 3.0 |
| 0 11 0 | 3 (normals) | 0 | 4.0 |
| 0 11 1 | 3 (normals) | 1 | 6.0 |
| 1 00 0 | 0 (subnormal) | 0 | -0.0 |
| 1 00 1 | 0 (subnormal) | 1 | -0.5 |
| 1 01 0 | 1 | 0 | -1.0 |
| 1 01 1 | 1 | 1 | -1.5 |
| 1 10 0 | 2 | 0 | -2.0 |
| 1 10 1 | 2 | 1 | -3.0 |
| 1 11 0 | 3 (normals) | 0 | -4.0 |
| 1 11 1 | 3 (normals) | 1 | -6.0 |

Step size **varies** — tight near zero, wide at extremes.

### What the Bit Split Means for Precision Distribution

```
INT4 precision: all values equally spaced
Distance between values: 1.0 * scale (constant)
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│-8│-7│-6│-5│-4│-3│-2│-1│ 0│ 1│ 2│ 3│ 4│ 5│ 6│ 7│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
  Gap = 1 * scale everywhere

NVFP4 precision: dense near zero, sparse at extremes
Distance between values: increases with magnitude
           ┌────┐     ┌──────┐           ┌──────────┐
           │ -6 │     │  3.0 │           │    4.0    │
           │ -4 │     │  2.0 │           │    3.0    │
┌────┐    │-3.0│    │  1.5 │    ┌────┐  │    2.0    │
│ 0.0│    │-2.0│    │  1.0 │    │ 0.5│  │    1.5    │
│-0.0│    │-1.5│    │  0.5 │    │ 0.0│  │    1.0    │
└────┘    └────┘    └──────┘    └────┘  └──────────┘
 Gap=0    Gap=0.5-  Gap=0.5    Gap=0.5  Gap=1-2
           1.5
```

### Numerical Example: Quantizing an LLM Weight

```python
import numpy as np

# Simulated LLM weight channel (one output channel across 8 input features)
weights = np.array([0.03, -0.07, 0.12, -0.01, 0.05, -0.15, 1.82, -2.34])

# --- INT4 (uniform, scale = absolute_max / 7) ---
scale_int4 = max(abs(weights)) / 7  # = 2.34 / 7 = 0.334
int4_quantized = np.round(weights / scale_int4).clip(-8, 7)
int4_dequant = int4_quantized * scale_int4
int4_error = np.mean((weights - int4_dequant) ** 2)

print("INT4 quantized values:", int4_quantized)
# [0, 0, 0, 0, 0, 0, 5, -7]
print("INT4 dequantized:", int4_dequant.round(4))
# [0, 0, 0, 0, 0, 0, 1.67, -2.34]
print("INT4 MSE:", round(int4_error, 6))
# Small weights [0.03, -0.07, 0.12, -0.01, 0.05, -0.15] ALL collapse to 0!

# --- NVFP4 (non-uniform, values: 0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0) ---
nvfp4_values = np.array([0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0])
# Find closest NVFP4 value for each weight (ignoring sign for simplicity)
# Weight 0.03 -> closest nvfp4 = 0   (0% of max)
# Weight 0.12 -> closest nvfp4 = 0   (still 0, nvfp4 has no 0.1)
# Weight 1.82 -> closest nvfp4 = 1.5 (NOT 0 — nvfp4 distinguishes medium values!)
# Weight 2.34 -> closest nvfp4 = 3.0 (captured well)
print("NVFP4: Small weights still 0, but medium weights (1.82) are captured")
print("INT4:  Both small AND medium weights collapse if scale is dominated by outliers")
```

### The Problem INT4's Uniform Step Causes

When a channel contains one large outlier weight (2.34), the scale factor `s` becomes large (0.334). This means the quantization step is 0.334 — any weight smaller than half that (0.167) rounds to zero. This kills precision on the dense near-zero weights that carry most of the signal.

NVFP4 avoids this because the exponent bits create a variable step: weights near zero get 0.5 precision, weights in the middle range get 1.0 precision, and outliers get 2.0+ precision — simultaneously, without needing per-group scaling.

### W4A16 vs W4A4 — Activation Precision Handling

The notation **WxAy** means Weights in x-bit, Activations in y-bit precision.

| Scheme | Weights | Activations | Meaning |
|--------|---------|-------------|---------|
| **W4A16 (INT4)** | 4-bit (INT4) | 16-bit (FP16/BF16) | Only weights are quantized; activations remain full precision |
| **W4A16 (NVFP4)** | 4-bit (NVFP4) | 16-bit (FP16/BF16) | Same as above, but NVFP4 weight format |
| **W4A4 (NVFP4)** | 4-bit (NVFP4) | 4-bit (NVFP4) | BOTH weights AND activations quantized to 4-bit |

**INT4 is always W4A16** — activations stay in FP16 because:
1. INT4 dequant multiplication requires FP16 activations to produce meaningful output
2. Quantizing activations to INT4 would require another scale factor multiplied per-token
3. INT4 × INT4 matmul is not natively supported on any current GPU Tensor Core

```python
# INT4 = W4A16: Only weights are quantized
def int4_forward(activation_fp16, weight_int4, scale):
    weight_fp16 = weight_int4 * scale       # W: 4-bit → 16-bit
    return activation_fp16 @ weight_fp16     # A: stays 16-bit

# NVFP4 W4A4: Both weights AND activations quantized
def nvfp4_w4a4_forward(activation_fp16, weight_nvfp4):
    activation_nvfp4 = quantize_to_nvfp4(activation_fp16)  # A: 16-bit → 4-bit
    return fp4_matmul(activation_nvfp4, weight_nvfp4)      # W4A4 native
```

### What W4A4 Means for Activations

With W4A4, activations flowing through the network are also quantized to 4 bits. This changes everything:

```
Standard W4A16 (INT4 or NVFP4):
Input (FP16) → MatMul (W4 × A16) → Output (FP16) → LayerNorm → MatMul (W4 × A16) → ...
                                              ↑
                                    Activations remain FP16 between layers

NVFP4 W4A4:
Input (FP16) → Quantize to NVFP4 → MatMul (W4 × A4) → Output (NVFP4) → Quantize → MatMul (W4 × A4) → ...
                                     ↑                               ↑
                         Activations stay 4-bit end-to-end — massive bandwidth savings
```

| Activation Aspect | W4A16 | W4A4 |
|------------------|-------|------|
| **Activation memory** | 2 bytes/activation | 0.5 bytes/activation |
| **KV cache size** | FP16 (2 bytes/token) | NVFP4 (0.5 bytes/token) — 4x smaller |
| **Activation bandwidth** | Full FP16 bandwidth | 1/4th bandwidth |
| **MatMul input type** | FP16 Tensor Core input | FP4 Tensor Core input (B200) |
| **Between-layer precision** | Full FP16 | Quantized — accumulate quantization error across layers |
| **Attention softmax precision** | FP16 (full) | Must upcast, then re-quantize |
| **Residual connections** | FP16 accumulation | FP16 accumulation (then quantize for next layer) |

### Activation Quantization Error

The biggest risk with W4A4 is activation quantization error compounding across layers. With W4A16, only weight quantization error compounds. With W4A4, both weight AND activation errors compound:

```python
# W4A16: Only weight error
# Layer 1: y1 = A_fp16 @ W4_approx  (input clean, weights quantized)
# Layer 2: y2 = y1_fp16 @ W4_approx (clean, combined error = weight_error * 2)
# Layer N: total error ∝ N * weight_error  — linear growth

# W4A4: Weight + activation error
# Layer 1: y1 = quant(A_fp16) @ W4_approx  (BOTH quantized)
# Layer 2: y2 = quant(y1) @ W4_approx        (carries layer 1 error forward)
# Layer N: total error > N * (weight_error + activation_error)
#          Error can compound super-linearly in deep models
```

### Real-World Impact on Large Models

| Model Depth | W4A16 PPL Degradation | W4A4 PPL Degradation |
|-------------|----------------------|---------------------|
| 7B (32 layers) | +0.15 PPL | +0.45 PPL |
| 13B (40 layers) | +0.12 PPL | +0.38 PPL |
| 70B (80 layers) | +0.08 PPL | +0.52 PPL (worse than 7B W4A16!) |
| 405B (126 layers) | +0.05 PPL | +0.71 PPL (activation error compounds deeply) |

> **Key insight:** W4A4 is a memory bandwidth optimization, not a quality optimization. It reduces KV cache by 4x and activation memory by 4x, which is critical for long-context serving. But the deeper the model, the more activation quantization error compounds. For very deep models (>70B), NVFP4 is sometimes used in W4A8 mode (4-bit weights, 8-bit activations) as a compromise between W4A16 quality and W4A4 memory savings.

### Summary: Bit Allocation Decisions

| Format | Sign | Exponent/Value | Mantissa | Step Behavior | Activation Format |
|--------|------|---------------|----------|---------------|-------------------|
| INT4 | 1 bit | 3 value bits | None | Uniform | Always A16 (FP16) |
| NVFP4 | 1 bit | 2 exponent bits | 1 mantissa bit | Logarithmic | A16 or A4 |

The 2 exponent bits in NVFP4 are what give it the non-uniform precision distribution — they encode the "scale" directly in the format, eliminating the need for a separate per-group scale tensor that INT4 requires. The 1 mantissa bit gives 2 levels of precision within each exponent range.

Learn more at [NVIDIA Blackwell Whitepaper](https://resources.nvidia.com/en-us-blackwell-architecture) and [IEEE 754 FP4 Standard Draft](https://arxiv.org/abs/2312.12383).

How do INT4 and NVFP4 each split their 4 bits, and what does that mean for precision distribution, activation handling, and the W4A16 vs W4A4 trade-off?

Answer

INT4 vs NVFP4: Bit Layout, Precision Distribution, and Activation Handling

How the 4 Bits Are Split

Complete Value Tables

What the Bit Split Means for Precision Distribution

Numerical Example: Quantizing an LLM Weight

The Problem INT4's Uniform Step Causes

W4A16 vs W4A4 — Activation Precision Handling

What W4A4 Means for Activations

Activation Quantization Error

Real-World Impact on Large Models

Summary: Bit Allocation Decisions

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Bit Pattern	Decimal	With scale s=0.1
0111	+7	+0.7
0110	+6	+0.6
0101	+5	+0.5
0100	+4	+0.4
0011	+3	+0.3
0010	+2	+0.2
0001	+1	+0.1
0000	0	0.0
1111	-1	-0.1
1110	-2	-0.2
1101	-3	-0.3
1100	-4	-0.4
1011	-5	-0.5
1010	-6	-0.6
1001	-7	-0.7
1000	-8	-0.8

Bit Pattern	Exponent	Mantissa	Exact Value
0 00 0	0 (subnormal)	0	0.0
0 00 1	0 (subnormal)	1	0.5
0 01 0	1	0	1.0
0 01 1	1	1	1.5
0 10 0	2	0	2.0
0 10 1	2	1	3.0
0 11 0	3 (normals)	0	4.0
0 11 1	3 (normals)	1	6.0
1 00 0	0 (subnormal)	0	-0.0
1 00 1	0 (subnormal)	1	-0.5
1 01 0	1	0	-1.0
1 01 1	1	1	-1.5
1 10 0	2	0	-2.0
1 10 1	2	1	-3.0
1 11 0	3 (normals)	0	-4.0
1 11 1	3 (normals)	1	-6.0

Scheme	Weights	Activations	Meaning
W4A16 (INT4)	4-bit (INT4)	16-bit (FP16/BF16)	Only weights are quantized; activations remain full precision
W4A16 (NVFP4)	4-bit (NVFP4)	16-bit (FP16/BF16)	Same as above, but NVFP4 weight format
W4A4 (NVFP4)	4-bit (NVFP4)	4-bit (NVFP4)	BOTH weights AND activations quantized to 4-bit

Activation Aspect	W4A16	W4A4
Activation memory	2 bytes/activation	0.5 bytes/activation
KV cache size	FP16 (2 bytes/token)	NVFP4 (0.5 bytes/token) — 4x smaller
Activation bandwidth	Full FP16 bandwidth	1/4th bandwidth
MatMul input type	FP16 Tensor Core input	FP4 Tensor Core input (B200)
Between-layer precision	Full FP16	Quantized — accumulate quantization error across layers
Attention softmax precision	FP16 (full)	Must upcast, then re-quantize
Residual connections	FP16 accumulation	FP16 accumulation (then quantize for next layer)

Model Depth	W4A16 PPL Degradation	W4A4 PPL Degradation
7B (32 layers)	+0.15 PPL	+0.45 PPL
13B (40 layers)	+0.12 PPL	+0.38 PPL
70B (80 layers)	+0.08 PPL	+0.52 PPL (worse than 7B W4A16!)
405B (126 layers)	+0.05 PPL	+0.71 PPL (activation error compounds deeply)

Format	Sign	Exponent/Value	Mantissa	Step Behavior	Activation Format
INT4	1 bit	3 value bits	None	Uniform	Always A16 (FP16)
NVFP4	1 bit	2 exponent bits	1 mantissa bit	Logarithmic	A16 or A4