How do INT4 and NVFP4 each split their 4 bits, and what does that mean for precision distribution, activation handling, and the W4A16 vs W4A4 trade-off?
Answer
INT4 vs NVFP4: Bit Layout, Precision Distribution, and Activation Handling
The core difference between INT4 and NVFP4 starts with how each format allocates its 4 bits ā this determines the precision distribution, representable values, and ultimately what each format can and cannot express.
How the 4 Bits Are Split
textINT4 (Signed Integer): āāāāā¬āāāā¬āāāā¬āāāā ā S ā V ā V ā V ā S = sign bit, V = value bits āāāāā“āāāā“āāāā“āāāā 1 sign + 3 magnitude bits = 16 values (-8 to +7) NVFP4 (E2M1 - Exponent 2 bits, Mantissa 1 bit): āāāāā¬āāāā¬āāāā¬āāāā ā S ā E ā E ā M ā S = sign, E = exponent, M = mantissa āāāāā“āāāā“āāāā“āāāā 1 sign + 2 exponent + 1 mantissa = 16 values (non-uniform)
Complete Value Tables
INT4 representable values (with scale factor s):
| Bit Pattern | Decimal | With scale s=0.1 |
|---|---|---|
| 0111 | +7 | +0.7 |
| 0110 | +6 | +0.6 |
| 0101 | +5 | +0.5 |
| 0100 | +4 | +0.4 |
| 0011 | +3 | +0.3 |
| 0010 | +2 | +0.2 |
| 0001 | +1 | +0.1 |
| 0000 | 0 | 0.0 |
| 1111 | -1 | -0.1 |
| 1110 | -2 | -0.2 |
| 1101 | -3 | -0.3 |
| 1100 | -4 | -0.4 |
| 1011 | -5 | -0.5 |
| 1010 | -6 | -0.6 |
| 1001 | -7 | -0.7 |
| 1000 | -8 | -0.8 |
Step size is always 1.0 Ć s ā perfectly uniform.
NVFP4 representable values (E2M1 format, bias=0, normal+subnormal):
| Bit Pattern | Exponent | Mantissa | Exact Value |
|---|---|---|---|
| 0 00 0 | 0 (subnormal) | 0 | 0.0 |
| 0 00 1 | 0 (subnormal) | 1 | 0.5 |
| 0 01 0 | 1 | 0 | 1.0 |
| 0 01 1 | 1 | 1 | 1.5 |
| 0 10 0 | 2 | 0 | 2.0 |
| 0 10 1 | 2 | 1 | 3.0 |
| 0 11 0 | 3 (normals) | 0 | 4.0 |
| 0 11 1 | 3 (normals) | 1 | 6.0 |
| 1 00 0 | 0 (subnormal) | 0 | -0.0 |
| 1 00 1 | 0 (subnormal) | 1 | -0.5 |
| 1 01 0 | 1 | 0 | -1.0 |
| 1 01 1 | 1 | 1 | -1.5 |
| 1 10 0 | 2 | 0 | -2.0 |
| 1 10 1 | 2 | 1 | -3.0 |
| 1 11 0 | 3 (normals) | 0 | -4.0 |
| 1 11 1 | 3 (normals) | 1 | -6.0 |
Step size varies ā tight near zero, wide at extremes.
What the Bit Split Means for Precision Distribution
textINT4 precision: all values equally spaced Distance between values: 1.0 * scale (constant) āāāā¬āāā¬āāā¬āāā¬āāā¬āāā¬āāā¬āāā¬āāā¬āāā¬āāā¬āāā¬āāā¬āāā¬āāā¬āāā ā-8ā-7ā-6ā-5ā-4ā-3ā-2ā-1ā 0ā 1ā 2ā 3ā 4ā 5ā 6ā 7ā āāāā“āāā“āāā“āāā“āāā“āāā“āāā“āāā“āāā“āāā“āāā“āāā“āāā“āāā“āāā“āāā Gap = 1 * scale everywhere NVFP4 precision: dense near zero, sparse at extremes Distance between values: increases with magnitude āāāāāā āāāāāāāā āāāāāāāāāāāā ā -6 ā ā 3.0 ā ā 4.0 ā ā -4 ā ā 2.0 ā ā 3.0 ā āāāāāā ā-3.0ā ā 1.5 ā āāāāāā ā 2.0 ā ā 0.0ā ā-2.0ā ā 1.0 ā ā 0.5ā ā 1.5 ā ā-0.0ā ā-1.5ā ā 0.5 ā ā 0.0ā ā 1.0 ā āāāāāā āāāāāā āāāāāāāā āāāāāā āāāāāāāāāāāā Gap=0 Gap=0.5- Gap=0.5 Gap=0.5 Gap=1-2 1.5
Numerical Example: Quantizing an LLM Weight
pythonimport numpy as np # Simulated LLM weight channel (one output channel across 8 input features) weights = np.array([0.03, -0.07, 0.12, -0.01, 0.05, -0.15, 1.82, -2.34]) # --- INT4 (uniform, scale = absolute_max / 7) --- scale_int4 = max(abs(weights)) / 7 # = 2.34 / 7 = 0.334 int4_quantized = np.round(weights / scale_int4).clip(-8, 7) int4_dequant = int4_quantized * scale_int4 int4_error = np.mean((weights - int4_dequant) ** 2) print("INT4 quantized values:", int4_quantized) # [0, 0, 0, 0, 0, 0, 5, -7] print("INT4 dequantized:", int4_dequant.round(4)) # [0, 0, 0, 0, 0, 0, 1.67, -2.34] print("INT4 MSE:", round(int4_error, 6)) # Small weights [0.03, -0.07, 0.12, -0.01, 0.05, -0.15] ALL collapse to 0! # --- NVFP4 (non-uniform, values: 0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0) --- nvfp4_values = np.array([0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0]) # Find closest NVFP4 value for each weight (ignoring sign for simplicity) # Weight 0.03 -> closest nvfp4 = 0 (0% of max) # Weight 0.12 -> closest nvfp4 = 0 (still 0, nvfp4 has no 0.1) # Weight 1.82 -> closest nvfp4 = 1.5 (NOT 0 ā nvfp4 distinguishes medium values!) # Weight 2.34 -> closest nvfp4 = 3.0 (captured well) print("NVFP4: Small weights still 0, but medium weights (1.82) are captured") print("INT4: Both small AND medium weights collapse if scale is dominated by outliers")
The Problem INT4's Uniform Step Causes
When a channel contains one large outlier weight (2.34), the scale factor
sNVFP4 avoids this because the exponent bits create a variable step: weights near zero get 0.5 precision, weights in the middle range get 1.0 precision, and outliers get 2.0+ precision ā simultaneously, without needing per-group scaling.
W4A16 vs W4A4 ā Activation Precision Handling
The notation WxAy means Weights in x-bit, Activations in y-bit precision.
| Scheme | Weights | Activations | Meaning |
|---|---|---|---|
| W4A16 (INT4) | 4-bit (INT4) | 16-bit (FP16/BF16) | Only weights are quantized; activations remain full precision |
| W4A16 (NVFP4) | 4-bit (NVFP4) | 16-bit (FP16/BF16) | Same as above, but NVFP4 weight format |
| W4A4 (NVFP4) | 4-bit (NVFP4) | 4-bit (NVFP4) | BOTH weights AND activations quantized to 4-bit |
INT4 is always W4A16 ā activations stay in FP16 because:
- INT4 dequant multiplication requires FP16 activations to produce meaningful output
- Quantizing activations to INT4 would require another scale factor multiplied per-token
- INT4 Ć INT4 matmul is not natively supported on any current GPU Tensor Core
python# INT4 = W4A16: Only weights are quantized def int4_forward(activation_fp16, weight_int4, scale): weight_fp16 = weight_int4 * scale # W: 4-bit ā 16-bit return activation_fp16 @ weight_fp16 # A: stays 16-bit # NVFP4 W4A4: Both weights AND activations quantized def nvfp4_w4a4_forward(activation_fp16, weight_nvfp4): activation_nvfp4 = quantize_to_nvfp4(activation_fp16) # A: 16-bit ā 4-bit return fp4_matmul(activation_nvfp4, weight_nvfp4) # W4A4 native
What W4A4 Means for Activations
With W4A4, activations flowing through the network are also quantized to 4 bits. This changes everything:
textStandard W4A16 (INT4 or NVFP4): Input (FP16) ā MatMul (W4 Ć A16) ā Output (FP16) ā LayerNorm ā MatMul (W4 Ć A16) ā ... ā Activations remain FP16 between layers NVFP4 W4A4: Input (FP16) ā Quantize to NVFP4 ā MatMul (W4 Ć A4) ā Output (NVFP4) ā Quantize ā MatMul (W4 Ć A4) ā ... ā ā Activations stay 4-bit end-to-end ā massive bandwidth savings
| Activation Aspect | W4A16 | W4A4 |
|---|---|---|
| Activation memory | 2 bytes/activation | 0.5 bytes/activation |
| KV cache size | FP16 (2 bytes/token) | NVFP4 (0.5 bytes/token) ā 4x smaller |
| Activation bandwidth | Full FP16 bandwidth | 1/4th bandwidth |
| MatMul input type | FP16 Tensor Core input | FP4 Tensor Core input (B200) |
| Between-layer precision | Full FP16 | Quantized ā accumulate quantization error across layers |
| Attention softmax precision | FP16 (full) | Must upcast, then re-quantize |
| Residual connections | FP16 accumulation | FP16 accumulation (then quantize for next layer) |
Activation Quantization Error
The biggest risk with W4A4 is activation quantization error compounding across layers. With W4A16, only weight quantization error compounds. With W4A4, both weight AND activation errors compound:
python# W4A16: Only weight error # Layer 1: y1 = A_fp16 @ W4_approx (input clean, weights quantized) # Layer 2: y2 = y1_fp16 @ W4_approx (clean, combined error = weight_error * 2) # Layer N: total error ā N * weight_error ā linear growth # W4A4: Weight + activation error # Layer 1: y1 = quant(A_fp16) @ W4_approx (BOTH quantized) # Layer 2: y2 = quant(y1) @ W4_approx (carries layer 1 error forward) # Layer N: total error > N * (weight_error + activation_error) # Error can compound super-linearly in deep models
Real-World Impact on Large Models
| Model Depth | W4A16 PPL Degradation | W4A4 PPL Degradation |
|---|---|---|
| 7B (32 layers) | +0.15 PPL | +0.45 PPL |
| 13B (40 layers) | +0.12 PPL | +0.38 PPL |
| 70B (80 layers) | +0.08 PPL | +0.52 PPL (worse than 7B W4A16!) |
| 405B (126 layers) | +0.05 PPL | +0.71 PPL (activation error compounds deeply) |
Key insight: W4A4 is a memory bandwidth optimization, not a quality optimization. It reduces KV cache by 4x and activation memory by 4x, which is critical for long-context serving. But the deeper the model, the more activation quantization error compounds. For very deep models (>70B), NVFP4 is sometimes used in W4A8 mode (4-bit weights, 8-bit activations) as a compromise between W4A16 quality and W4A4 memory savings.
Summary: Bit Allocation Decisions
| Format | Sign | Exponent/Value | Mantissa | Step Behavior | Activation Format |
|---|---|---|---|---|---|
| INT4 | 1 bit | 3 value bits | None | Uniform | Always A16 (FP16) |
| NVFP4 | 1 bit | 2 exponent bits | 1 mantissa bit | Logarithmic | A16 or A4 |
The 2 exponent bits in NVFP4 are what give it the non-uniform precision distribution ā they encode the "scale" directly in the format, eliminating the need for a separate per-group scale tensor that INT4 requires. The 1 mantissa bit gives 2 levels of precision within each exponent range.
Learn more at NVIDIA Blackwell Whitepaper and IEEE 754 FP4 Standard Draft.