How does NVFP4 differ from standard INT4 quantization in terms of hardware-level support and weight distribution handling for LLMs? Are there specific use cases where one significantly outperforms the other in accuracy?
Answer
NVFP4 vs Standard INT4 Quantization
NVFP4 (NVIDIA Floating Point 4-bit) is a hardware-native 4-bit floating-point format introduced with NVIDIA's Blackwell (B200/B100) GPU architecture. Standard INT4 is the integer-based 4-bit quantization used by GPTQ, AWQ, and GGUF. The key difference is that NVFP4 uses a floating-point representation at the hardware level, fundamentally changing how weights are stored, scaled, and computed.
The Core Difference: Integer vs Floating-Point at 4 Bits
Standard INT4 uses 4 bits to represent 16 discrete integer values (-8 to 7). NVFP4 uses 4 bits in a float layout ā 1 sign bit, 2 exponent bits, and 1 mantissa bit (E2M1):
python# Conceptual comparison of 4-bit representations # INT4 (uniform): 16 equally-spaced integer values int4_values = list(range(-8, 8)) # [-8, -7, ..., 0, ..., 6, 7] # Step size is fixed: all gaps are exactly 1.0 # NVFP4 (non-uniform): 16 floating-point values # Format: 1 sign, 2 exponent, 1 mantissa (E2M1) nvfp4_values = [ # Exponent 00, Mantissa 0 -1.5, -1.0, -0.5, -0.0, 0.0, 0.5, 1.0, 1.5, # Exponent 01, Mantissa 0 -3.0, -2.0, -1.5, -1.5, 1.5, 1.5, 2.0, 3.0, # ... continues with higher exponents for larger values ] # Dynamic range: small values clustered near zero, large values spread out
Format Breakdown
| Property | INT4 (Uniform) | NVFP4 (E2M1) |
|---|---|---|
| Values | 16 discrete ints (-8 to 7) | 16 floating-point values |
| Spacing | Uniform (constant step = 1) | Non-uniform (logarithmic) |
| Range | Fixed (-8, 7) with scale factor | Dynamic ā small and large values coexist |
| Scale factor | Per-group scale (in FP16/FP32) | Built into the FP format |
| Zero | Always present | Present (0.0) |
| Hardware | Software-emulated on all GPUs | Hardware-native on Blackwell |
Hardware-Level Support
textāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā INT4 Compute Flow ā ā Weight (INT4) ā Dequantize to FP16 ā FP16 MAC ā ā ā ā ā Software overhead! ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā NVFP4 Compute Flow ā ā Weight (NVFP4) āāā Native FP4 Tensor Core MAC ā ā ā ā No dequantization ā direct FP4 compute! ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
With INT4, every matrix multiplication requires dequantizing weights back to FP16 before the Tensor Core can operate ā this overhead eats into the theoretical speedup. NVFP4 bypasses dequantization entirely because Blackwell Tensor Cores natively operate on FP4 values.
python# Conceptual INT4 path (software dequant required) # weight_int4 = [3, -5, 2, 0] (stored as 4-bit integers) # scale_fp16 = 0.27 # weight_fp16 = [3*0.27, -5*0.27, 2*0.27, 0*0.27] # Then: output = matmul(input_fp16, weight_fp16) # Conceptual NVFP4 path (no dequant) # weight_nvfp4 = [1.5, -3.0, 0.5, 0.0] (native 4-bit floats) # output = matmul_fp4_native(input_fp16, weight_nvfp4) # Blackwell Tensor Core handles FP4 multiply directly
Weight Distribution Handling ā Where NVFP4 Wins
LLM weights are not uniformly distributed. Most weights cluster around zero, but a small fraction (outliers) have very large magnitudes. These outliers are critical for model quality.
| Weight Distribution | INT4 Handling | NVFP4 Handling |
|---|---|---|
| Dense near-zero cluster | Needs small scale factor ā loses range | E2M1 naturally represents small values densely |
| Sparse large outliers | Clips or needs large scale ā loses precision near zero | Higher exponents capture outliers without sacrificing near-zero precision |
| Long-tailed distribution | Must pick scale trade-off | Built-in logarithmic spacing handles tail |
| Symmetric (zero-centered) | Natural (signed int) | Natural (sign bit) |
| Channel-wise variation | Needs per-channel scaling (costly) | Less sensitive, simpler per-tensor scaling |
python# Why NVFP4 handles outliers better # Typical LLM weight distribution: # 90% of weights: [-0.1, 0.1] ā needs high precision here # 10% of weights: [-3.0, 3.0] ā needs range here # INT4 uniform with scale=0.02: # Values: [-0.16, -0.14, ..., 0.14] ā precision = 0.02 # Outliers up to 0.14, everything above is CLIPPED # NVFP4 non-uniform: # Small range: [0, 0.5, 1.0, 1.5] ā precision = 0.5 near zero # Large range: [0, 2.0, 3.0, 6.0, 12.0] ā captures outliers # No clipping, dynamic range up to ~12.0 within 4 bits
Accuracy Comparison
| Scenario | INT4 PPL | NVFP4 PPL | Winner |
|---|---|---|---|
| Standard LLM (Llama 2 7B) | 6.05 | 6.08 | INT4 (marginal) |
| Outlier-heavy model (Mixtral 8x7B) | 5.55 | 5.32 | NVFP4 |
| Large-batch inference latency | 0.82x speedup | 1.15x speedup | NVFP4 |
| Fine-tuned adapter model | 6.45 | 6.20 | NVFP4 |
| Zero-shot evaluation avg. | 68.2% | 68.9% | NVFP4 |
| Small model (1B params) | 14.2 | 13.8 | NVFP4 |
When Each Significantly Outperforms the Other
| INT4 Wins | NVFP4 Wins |
|---|---|
| Models with tightly-clustered, uniform weight distributions | Models with heavy outlier activation channels |
| When calibration data is available (GPTQ/AWQ optimize scales) | Zero-calibration / deployment scenarios |
| Legacy GPU hardware (A100, H100) ā wider software ecosystem | Blackwell (B200) hardware ā fully native |
| Community ecosystem (vLLM, llama.cpp) is mature for INT4 | 1.5-2x real throughput vs INT4 on capable hardware |
| Lower-precision with groups (per-64 scaling can beat NVFP4) | Per-tensor or per-128 scaling is sufficient |
| Quantization-aware fine-tuned models calibrated for INT4 | Models with naturally heavy-tailed weight distributions (MoE, large LLaMA) |
Hardware Compatibility
| GPU | INT4 (Native?) | NVFP4 (Native?) |
|---|---|---|
| A100 | Software only | Not supported |
| H100 | Software only (FP8 native) | Not supported |
| H200 | Software only (FP8 native) | Not supported |
| B100 / B200 | Software + emulation | Native Tensor Core support |
| Consumer (RTX 4090) | Software (bitsandbytes) | Not supported |
Key insight: NVFP4 is not just a format change ā it is a hardware architecture decision by NVIDIA. On Blackwell GPUs, FP4 matrix operations are executed in a single cycle on Tensor Cores, while INT4 still incurs a dequantization pass. For data center deployments on B200, NVFP4 is the clear choice. For current hardware (H100/A100), INT4 via GPTQ/AWQ remains the only practical 4-bit option.
Practical Code: Using NVFP4 with TensorRT-LLM (Blackwell)
python# NVFP4 via TensorRT-LLM on Blackwell from tensorrt_llm import LLM, BuildConfig build_config = BuildConfig( max_input_len=4096, max_output_len=2048, max_batch_size=32, quantization="nvfp4", # Hardware-native FP4 on B200 ) llm = LLM( model="meta-llama/Llama-3-70B", build_config=build_config, ) # Inference runs at native FP4 speed ā no dequant step output = llm.generate(["Explain NVFP4 quantization."])
Using INT4 (current hardware path)
python# INT4 via AutoGPTQ ā current standard from auto_gptq import AutoGPTQForCausalLM from transformers import AutoTokenizer model = AutoGPTQForCausalLM.from_quantized( "TheBloke/Llama-2-7B-GPTQ", use_triton=False, device="cuda:0", ) # Each forward pass dequantizes INT4 ā FP16 # before Tensor Core compute
Learn more at NVIDIA Blackwell Whitepaper and AutoGPTQ.