What is the internal difference between using INT4 vs NVFP4? Which is better, what internal optimizations have been done, and is this related to model training base or usage/agent architecture?
Answer
Internal Differences Between INT4 and NVFP4
The difference between INT4 and NVFP4 is not just about format ā it goes down to the silicon level, affecting how matrix math is executed inside the GPU, whether dequantization hardware is needed, and ultimately what workloads each is optimized for.
Internal Mathematical Pipeline
textINT4 Inference Pipeline (on any GPU): āā INT4 Weights āāā āāā Dequantize (INT4 ā FP16) āāā FP16 MatMul āāā Output āā FP16 Input āāāāā ā Additional memory read + 2 FP16 multiplies per weight element NVFP4 Inference Pipeline (on Blackwell): āā NVFP4 Weights āā āāā Native FP4 Tensor Core āāā Output āā FP16 Input āāāāā ā Single-cycle FP4 multiply-accumulate No dequant hardware needed
Silicon-Level Internal Operations
| Internal Aspect | INT4 (Software Path) | NVFP4 (Hardware Path) |
|---|---|---|
| Execution unit | Standard Tensor Core (FP16 input) | FP4-native Tensor Core (FP4 input) |
| Dequant cycle | 1 cycle per weight element (scale read + 1 FP16 mul) | 0 cycles (weights consumed directly) |
| Weight fetch | 2 reads per group (INT4 weight + FP16 scale) | 1 read per group (NVFP4 weight only) |
| Compute pipeline | 2-stage: dequant ā matmul | 1-stage: matmul only |
| Memory bandwidth per token | Higher (scale tensor + weight tensor) | Lower (weight tensor only) |
| Warp occupancy | Lower (dequant uses registers) | Higher (more registers for actual compute) |
| Throughput (effective) | 60-75% of theoretical INT4 speed | 100% of theoretical FP4 speed |
python# Internal operation difference illustrated # INT4: Software-emulated quantization math # Each token input flows through this for EVERY weight: def int4_inner_loop(input_fp16, weight_int4, scale_fp16): # Step 1: Dequantize (internal overhead) weight_fp16 = weight_int4 * scale_fp16 # Extra FP16 multiply # Step 2: Standard matmul return input_fp16 * weight_fp16 + accumulator # Actually 2 FLOPs per weight element, not 1 # NVFP4: Hardware-native fused operation # Single Tensor Core instruction: def nvfp4_inner_loop(input_fp16, weight_nvfp4): # No dequant step at all # Blackwell PTX: fma.rn.f16.f16.e2m1 return f16_fp4_fused_mac(input_fp16, weight_nvfp4) # Exactly 1 FLOP per weight element
Which Is Better? The Answer Depends on Context
| Context | Better Choice | Why |
|---|---|---|
| Current GPUs (H100/A100) | INT4 | NVFP4 not supported ā no choice |
| Blackwell GPUs (B200) ā production serving | NVFP4 | Native FP4 throughput, zero dequant overhead |
| Blackwell GPUs ā training | Neither | Use FP8/BF16 for training; FP4 is for inference |
| Model with heavy outliers | NVFP4 | Non-uniform FP format handles tail distribution |
| Uniform weight distribution | INT4 | Calibrated INT4 can match or exceed NVFP4 quality |
| CPU inference (llama.cpp) | INT4 (GGUF) | NVFP4 requires NVIDIA GPU silicon |
| Agent architectures (multi-step) | NVFP4 | Lower per-token latency = faster agent pipelines |
| Batch serving (high throughput) | NVFP4 | 1.5-2x effective throughput on Blackwell |
| Ecosystem maturity | INT4 | GPTQ/AWQ tooling is stable; NVFP4 tooling is new |
Internal Optimizations ā What Each Format Optimizes For
INT4 internal optimizations:
| Optimization | Details |
|---|---|
| Per-group scaling (group_size=64/128) | Fine-grained calibration ā each group gets its own FP16 scale |
| GPTQ layer-wise Hessian | Optimal rounding decisions using second-order information |
| AWQ activation-awareness | Identifies salient weight channels and scales them up |
| ExLlamaV2 4-bit kernel fusion | Fuses dequant + matmul into a single custom CUDA kernel |
| Marlin kernel | Specialized INT4 kernel achieving 90%+ of theoretical bandwidth |
NVFP4 internal optimizations:
| Optimization | Details |
|---|---|
| Native FP4 Tensor Core | New SM (Streaming Multiprocessor) instructions on Blackwell |
| Zero-dequant pipeline | Eliminates the entire dequant hardware stage |
| Reduced register pressure | No scale tensors in registers = 15-20% more register bandwidth for compute |
| Block-level FP4 quantization | Quantized in blocks without per-channel scaling overhead |
| Unified memory path | Weights and activations share FP format ā simpler memory controller logic |
| Micro-tensor scaling | FP4 values implicitly scaled by exponent bits ā no external scale tensor |
Training Base vs Inference vs Agent Architecture
Is NVFP4/INT4 used for model training? No ā these are purely inference-time quantization formats. Training always uses FP32, BF16, or FP8 (on H100). You never train a base model in 4-bit.
textTraining Pipeline: Full weights (BF16/FP32) ā Forward pass ā Loss ā Backward ā Update ā After training complete ā Inference Pipeline: ā Full weights ā POST-TRAINING QUANTIZATION (INT4/NVFP4) ā Deploy
Inference / Usage Architecture: Both INT4 and NVFP4 apply only to the forward pass during inference. They have zero impact on:
- Model architecture (number of layers, attention heads, etc.)
- Training methodology (pre-training, SFT, RLHF)
- Prompt engineering or chain-of-thought behavior
python# Quantization is a deployment step, not a training step from transformers import AutoModelForCausalLM # Step 1: Train with BF16 (standard) model = train_base_model(config) # Returns BF16 model # Step 2: POST-training quantization for deployment # This is where INT4 or NVFP4 comes in ā after training quantized_model = apply_int4_quantization(model) # Option A quantized_model = apply_nvfp4_quantization(model) # Option B # Step 3: Deploy the quantized model for inference serve_model(quantized_model)
Agent Architecture: Quantization format significantly impacts agent performance because agents make multiple sequential LLM calls:
textAgent Loop (multi-step): User Query ā LLM Call 1 (Plan) ā LLM Call 2 (Tool call) ā LLM Call 3 (Reason) ā LLM Call 4 (Respond) ā User Each call goes through the quantization pipeline. NVFP4's lower per-token latency compounds across 4+ calls.
| Agent Pattern | INT4 Latency Impact | NVFP4 Latency Impact |
|---|---|---|
| ReAct (reason + act) | 3-5 calls Ć 150ms | 3-5 calls Ć 100ms |
| Multi-agent debate | 6-10 calls Ć 150ms | 6-10 calls Ć 100ms |
| Tool-use orchestration | 2-8 calls Ć 150ms | 2-8 calls Ć 100ms |
| Chain-of-thought agents | 1 call Ć 300ms | 1 call Ć 200ms |
Key insight: For single-turn inference, INT4 vs NVFP4 is a latency difference of ~30%. For multi-turn agent architectures with 5-10 sequential LLM calls, the compounded difference becomes significant ā 1.5 seconds vs 1.0 second for a typical agent interaction, which matters for user experience.
Decision Flowchart
textNeed to quantize a model for inference? āāā Running on H100/A100? ā āāā Use INT4 (GPTQ/AWQ) ā NVFP4 not available āāā Running on B200 (Blackwell)? ā āāā Agent-like multi-turn workload? ā ā āāā Use NVFP4 ā compound latency savings ā āāā Batch throughput critical? ā ā āāā Use NVFP4 ā 1.5-2x effective throughput ā āāā Outlier-heavy model (MoE, very large)? ā ā āāā Use NVFP4 ā FP format handles outliers ā āāā Simple chat / single-turn? ā āāā INT4 (AWQ) is fine ā maturity advantage āāā Running on CPU / edge? ā āāā Use INT4 (GGUF) ā only option āāā Training? āāā None ā use BF16/FP8. INT4/NVFP4 are inference-only
Learn more at NVIDIA Blackwell Architecture and TensorRT-LLM Quantization Guide.