How do INT4 and NVFP4 each split their 4 bits, and what does that mean for precision distribution, activation handling, and the W4A16 vs W4A4 trade-off?

#gen-ai#quantization#nvfp4#int4#w4a16#w4a4#precision#activations#llm#e2m1

Answer

INT4 vs NVFP4: Bit Layout, Precision Distribution, and Activation Handling

The core difference between INT4 and NVFP4 starts with how each format allocates its 4 bits — this determines the precision distribution, representable values, and ultimately what each format can and cannot express.

How the 4 Bits Are Split

text
INT4 (Signed Integer):
ā”Œā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”
│ S │ V  │ V  │ V  │   S = sign bit, V = value bits
ā””ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”˜
1 sign + 3 magnitude bits = 16 values (-8 to +7)

NVFP4 (E2M1 - Exponent 2 bits, Mantissa 1 bit):
ā”Œā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”
│ S │ E  │ E  │ M  │   S = sign, E = exponent, M = mantissa
ā””ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”˜
1 sign + 2 exponent + 1 mantissa = 16 values (non-uniform)

Complete Value Tables

INT4 representable values (with scale factor s):

Bit PatternDecimalWith scale s=0.1
0111+7+0.7
0110+6+0.6
0101+5+0.5
0100+4+0.4
0011+3+0.3
0010+2+0.2
0001+1+0.1
000000.0
1111-1-0.1
1110-2-0.2
1101-3-0.3
1100-4-0.4
1011-5-0.5
1010-6-0.6
1001-7-0.7
1000-8-0.8

Step size is always 1.0 Ɨ s — perfectly uniform.

NVFP4 representable values (E2M1 format, bias=0, normal+subnormal):

Bit PatternExponentMantissaExact Value
0 00 00 (subnormal)00.0
0 00 10 (subnormal)10.5
0 01 0101.0
0 01 1111.5
0 10 0202.0
0 10 1213.0
0 11 03 (normals)04.0
0 11 13 (normals)16.0
1 00 00 (subnormal)0-0.0
1 00 10 (subnormal)1-0.5
1 01 010-1.0
1 01 111-1.5
1 10 020-2.0
1 10 121-3.0
1 11 03 (normals)0-4.0
1 11 13 (normals)1-6.0

Step size varies — tight near zero, wide at extremes.

What the Bit Split Means for Precision Distribution

text
INT4 precision: all values equally spaced
Distance between values: 1.0 * scale (constant)
ā”Œā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”¬ā”€ā”€ā”
│-8│-7│-6│-5│-4│-3│-2│-1│ 0│ 1│ 2│ 3│ 4│ 5│ 6│ 7│
ā””ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”“ā”€ā”€ā”˜
  Gap = 1 * scale everywhere

NVFP4 precision: dense near zero, sparse at extremes
Distance between values: increases with magnitude
           ā”Œā”€ā”€ā”€ā”€ā”     ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”           ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
           │ -6 │     │  3.0 │           │    4.0    │
           │ -4 │     │  2.0 │           │    3.0    │
ā”Œā”€ā”€ā”€ā”€ā”    │-3.0│    │  1.5 │    ā”Œā”€ā”€ā”€ā”€ā”  │    2.0    │
│ 0.0│    │-2.0│    │  1.0 │    │ 0.5│  │    1.5    │
│-0.0│    │-1.5│    │  0.5 │    │ 0.0│  │    1.0    │
ā””ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    ā””ā”€ā”€ā”€ā”€ā”˜  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
 Gap=0    Gap=0.5-  Gap=0.5    Gap=0.5  Gap=1-2
           1.5

Numerical Example: Quantizing an LLM Weight

python
import numpy as np

# Simulated LLM weight channel (one output channel across 8 input features)
weights = np.array([0.03, -0.07, 0.12, -0.01, 0.05, -0.15, 1.82, -2.34])

# --- INT4 (uniform, scale = absolute_max / 7) ---
scale_int4 = max(abs(weights)) / 7  # = 2.34 / 7 = 0.334
int4_quantized = np.round(weights / scale_int4).clip(-8, 7)
int4_dequant = int4_quantized * scale_int4
int4_error = np.mean((weights - int4_dequant) ** 2)

print("INT4 quantized values:", int4_quantized)
# [0, 0, 0, 0, 0, 0, 5, -7]
print("INT4 dequantized:", int4_dequant.round(4))
# [0, 0, 0, 0, 0, 0, 1.67, -2.34]
print("INT4 MSE:", round(int4_error, 6))
# Small weights [0.03, -0.07, 0.12, -0.01, 0.05, -0.15] ALL collapse to 0!

# --- NVFP4 (non-uniform, values: 0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0) ---
nvfp4_values = np.array([0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0])
# Find closest NVFP4 value for each weight (ignoring sign for simplicity)
# Weight 0.03 -> closest nvfp4 = 0   (0% of max)
# Weight 0.12 -> closest nvfp4 = 0   (still 0, nvfp4 has no 0.1)
# Weight 1.82 -> closest nvfp4 = 1.5 (NOT 0 — nvfp4 distinguishes medium values!)
# Weight 2.34 -> closest nvfp4 = 3.0 (captured well)
print("NVFP4: Small weights still 0, but medium weights (1.82) are captured")
print("INT4:  Both small AND medium weights collapse if scale is dominated by outliers")

The Problem INT4's Uniform Step Causes

When a channel contains one large outlier weight (2.34), the scale factor

text
s
becomes large (0.334). This means the quantization step is 0.334 — any weight smaller than half that (0.167) rounds to zero. This kills precision on the dense near-zero weights that carry most of the signal.

NVFP4 avoids this because the exponent bits create a variable step: weights near zero get 0.5 precision, weights in the middle range get 1.0 precision, and outliers get 2.0+ precision — simultaneously, without needing per-group scaling.

W4A16 vs W4A4 — Activation Precision Handling

The notation WxAy means Weights in x-bit, Activations in y-bit precision.

SchemeWeightsActivationsMeaning
W4A16 (INT4)4-bit (INT4)16-bit (FP16/BF16)Only weights are quantized; activations remain full precision
W4A16 (NVFP4)4-bit (NVFP4)16-bit (FP16/BF16)Same as above, but NVFP4 weight format
W4A4 (NVFP4)4-bit (NVFP4)4-bit (NVFP4)BOTH weights AND activations quantized to 4-bit

INT4 is always W4A16 — activations stay in FP16 because:

  1. INT4 dequant multiplication requires FP16 activations to produce meaningful output
  2. Quantizing activations to INT4 would require another scale factor multiplied per-token
  3. INT4 Ɨ INT4 matmul is not natively supported on any current GPU Tensor Core
python
# INT4 = W4A16: Only weights are quantized
def int4_forward(activation_fp16, weight_int4, scale):
    weight_fp16 = weight_int4 * scale       # W: 4-bit → 16-bit
    return activation_fp16 @ weight_fp16     # A: stays 16-bit

# NVFP4 W4A4: Both weights AND activations quantized
def nvfp4_w4a4_forward(activation_fp16, weight_nvfp4):
    activation_nvfp4 = quantize_to_nvfp4(activation_fp16)  # A: 16-bit → 4-bit
    return fp4_matmul(activation_nvfp4, weight_nvfp4)      # W4A4 native

What W4A4 Means for Activations

With W4A4, activations flowing through the network are also quantized to 4 bits. This changes everything:

text
Standard W4A16 (INT4 or NVFP4):
Input (FP16) → MatMul (W4 Ɨ A16) → Output (FP16) → LayerNorm → MatMul (W4 Ɨ A16) → ...
                                              ↑
                                    Activations remain FP16 between layers

NVFP4 W4A4:
Input (FP16) → Quantize to NVFP4 → MatMul (W4 Ɨ A4) → Output (NVFP4) → Quantize → MatMul (W4 Ɨ A4) → ...
                                     ↑                               ↑
                         Activations stay 4-bit end-to-end — massive bandwidth savings
Activation AspectW4A16W4A4
Activation memory2 bytes/activation0.5 bytes/activation
KV cache sizeFP16 (2 bytes/token)NVFP4 (0.5 bytes/token) — 4x smaller
Activation bandwidthFull FP16 bandwidth1/4th bandwidth
MatMul input typeFP16 Tensor Core inputFP4 Tensor Core input (B200)
Between-layer precisionFull FP16Quantized — accumulate quantization error across layers
Attention softmax precisionFP16 (full)Must upcast, then re-quantize
Residual connectionsFP16 accumulationFP16 accumulation (then quantize for next layer)

Activation Quantization Error

The biggest risk with W4A4 is activation quantization error compounding across layers. With W4A16, only weight quantization error compounds. With W4A4, both weight AND activation errors compound:

python
# W4A16: Only weight error
# Layer 1: y1 = A_fp16 @ W4_approx  (input clean, weights quantized)
# Layer 2: y2 = y1_fp16 @ W4_approx (clean, combined error = weight_error * 2)
# Layer N: total error āˆ N * weight_error  — linear growth

# W4A4: Weight + activation error
# Layer 1: y1 = quant(A_fp16) @ W4_approx  (BOTH quantized)
# Layer 2: y2 = quant(y1) @ W4_approx        (carries layer 1 error forward)
# Layer N: total error > N * (weight_error + activation_error)
#          Error can compound super-linearly in deep models

Real-World Impact on Large Models

Model DepthW4A16 PPL DegradationW4A4 PPL Degradation
7B (32 layers)+0.15 PPL+0.45 PPL
13B (40 layers)+0.12 PPL+0.38 PPL
70B (80 layers)+0.08 PPL+0.52 PPL (worse than 7B W4A16!)
405B (126 layers)+0.05 PPL+0.71 PPL (activation error compounds deeply)

Key insight: W4A4 is a memory bandwidth optimization, not a quality optimization. It reduces KV cache by 4x and activation memory by 4x, which is critical for long-context serving. But the deeper the model, the more activation quantization error compounds. For very deep models (>70B), NVFP4 is sometimes used in W4A8 mode (4-bit weights, 8-bit activations) as a compromise between W4A16 quality and W4A4 memory savings.

Summary: Bit Allocation Decisions

FormatSignExponent/ValueMantissaStep BehaviorActivation Format
INT41 bit3 value bitsNoneUniformAlways A16 (FP16)
NVFP41 bit2 exponent bits1 mantissa bitLogarithmicA16 or A4

The 2 exponent bits in NVFP4 are what give it the non-uniform precision distribution — they encode the "scale" directly in the format, eliminating the need for a separate per-group scale tensor that INT4 requires. The 1 mantissa bit gives 2 levels of precision within each exponent range.

Learn more at NVIDIA Blackwell Whitepaper and IEEE 754 FP4 Standard Draft.