What is the internal difference between using INT4 vs NVFP4? Which is better, what internal optimizations have been done, and is this related to model training base or usage/agent architecture?

#gen-ai#quantization#nvfp4#int4#nvidia#blackwell#llm#agent-architecture#optimization#silicon

Answer

Internal Differences Between INT4 and NVFP4

The difference between INT4 and NVFP4 is not just about format — it goes down to the silicon level, affecting how matrix math is executed inside the GPU, whether dequantization hardware is needed, and ultimately what workloads each is optimized for.

Internal Mathematical Pipeline

text
INT4 Inference Pipeline (on any GPU):
── INT4 Weights ──┐
                  ā”œā”€ā†’ Dequantize (INT4 → FP16) ──→ FP16 MatMul ──→ Output
── FP16 Input ā”€ā”€ā”€ā”€ā”˜      │
                    Additional memory
                    read + 2 FP16 multiplies
                    per weight element

NVFP4 Inference Pipeline (on Blackwell):
── NVFP4 Weights ─┐
                  ā”œā”€ā†’ Native FP4 Tensor Core ──→ Output
── FP16 Input ā”€ā”€ā”€ā”€ā”˜      │
                    Single-cycle FP4 multiply-accumulate
                    No dequant hardware needed

Silicon-Level Internal Operations

Internal AspectINT4 (Software Path)NVFP4 (Hardware Path)
Execution unitStandard Tensor Core (FP16 input)FP4-native Tensor Core (FP4 input)
Dequant cycle1 cycle per weight element (scale read + 1 FP16 mul)0 cycles (weights consumed directly)
Weight fetch2 reads per group (INT4 weight + FP16 scale)1 read per group (NVFP4 weight only)
Compute pipeline2-stage: dequant → matmul1-stage: matmul only
Memory bandwidth per tokenHigher (scale tensor + weight tensor)Lower (weight tensor only)
Warp occupancyLower (dequant uses registers)Higher (more registers for actual compute)
Throughput (effective)60-75% of theoretical INT4 speed100% of theoretical FP4 speed
python
# Internal operation difference illustrated

# INT4: Software-emulated quantization math
# Each token input flows through this for EVERY weight:
def int4_inner_loop(input_fp16, weight_int4, scale_fp16):
    # Step 1: Dequantize (internal overhead)
    weight_fp16 = weight_int4 * scale_fp16  # Extra FP16 multiply
    
    # Step 2: Standard matmul
    return input_fp16 * weight_fp16 + accumulator
    # Actually 2 FLOPs per weight element, not 1

# NVFP4: Hardware-native fused operation
# Single Tensor Core instruction:
def nvfp4_inner_loop(input_fp16, weight_nvfp4):
    # No dequant step at all
    # Blackwell PTX: fma.rn.f16.f16.e2m1
    return f16_fp4_fused_mac(input_fp16, weight_nvfp4)
    # Exactly 1 FLOP per weight element

Which Is Better? The Answer Depends on Context

ContextBetter ChoiceWhy
Current GPUs (H100/A100)INT4NVFP4 not supported — no choice
Blackwell GPUs (B200) — production servingNVFP4Native FP4 throughput, zero dequant overhead
Blackwell GPUs — trainingNeitherUse FP8/BF16 for training; FP4 is for inference
Model with heavy outliersNVFP4Non-uniform FP format handles tail distribution
Uniform weight distributionINT4Calibrated INT4 can match or exceed NVFP4 quality
CPU inference (llama.cpp)INT4 (GGUF)NVFP4 requires NVIDIA GPU silicon
Agent architectures (multi-step)NVFP4Lower per-token latency = faster agent pipelines
Batch serving (high throughput)NVFP41.5-2x effective throughput on Blackwell
Ecosystem maturityINT4GPTQ/AWQ tooling is stable; NVFP4 tooling is new

Internal Optimizations — What Each Format Optimizes For

INT4 internal optimizations:

OptimizationDetails
Per-group scaling (group_size=64/128)Fine-grained calibration — each group gets its own FP16 scale
GPTQ layer-wise HessianOptimal rounding decisions using second-order information
AWQ activation-awarenessIdentifies salient weight channels and scales them up
ExLlamaV2 4-bit kernel fusionFuses dequant + matmul into a single custom CUDA kernel
Marlin kernelSpecialized INT4 kernel achieving 90%+ of theoretical bandwidth

NVFP4 internal optimizations:

OptimizationDetails
Native FP4 Tensor CoreNew SM (Streaming Multiprocessor) instructions on Blackwell
Zero-dequant pipelineEliminates the entire dequant hardware stage
Reduced register pressureNo scale tensors in registers = 15-20% more register bandwidth for compute
Block-level FP4 quantizationQuantized in blocks without per-channel scaling overhead
Unified memory pathWeights and activations share FP format — simpler memory controller logic
Micro-tensor scalingFP4 values implicitly scaled by exponent bits — no external scale tensor

Training Base vs Inference vs Agent Architecture

Is NVFP4/INT4 used for model training? No — these are purely inference-time quantization formats. Training always uses FP32, BF16, or FP8 (on H100). You never train a base model in 4-bit.

text
Training Pipeline:
Full weights (BF16/FP32) → Forward pass → Loss → Backward → Update
                                      ↓
                              After training complete
                                      ↓
Inference Pipeline:                   ↓
Full weights → POST-TRAINING QUANTIZATION (INT4/NVFP4) → Deploy

Inference / Usage Architecture: Both INT4 and NVFP4 apply only to the forward pass during inference. They have zero impact on:

  • Model architecture (number of layers, attention heads, etc.)
  • Training methodology (pre-training, SFT, RLHF)
  • Prompt engineering or chain-of-thought behavior
python
# Quantization is a deployment step, not a training step
from transformers import AutoModelForCausalLM

# Step 1: Train with BF16 (standard)
model = train_base_model(config)  # Returns BF16 model

# Step 2: POST-training quantization for deployment
# This is where INT4 or NVFP4 comes in — after training
quantized_model = apply_int4_quantization(model)     # Option A
quantized_model = apply_nvfp4_quantization(model)    # Option B

# Step 3: Deploy the quantized model for inference
serve_model(quantized_model)

Agent Architecture: Quantization format significantly impacts agent performance because agents make multiple sequential LLM calls:

text
Agent Loop (multi-step):
User Query → LLM Call 1 (Plan) → LLM Call 2 (Tool call) →
LLM Call 3 (Reason) → LLM Call 4 (Respond) → User

Each call goes through the quantization pipeline.
NVFP4's lower per-token latency compounds across 4+ calls.
Agent PatternINT4 Latency ImpactNVFP4 Latency Impact
ReAct (reason + act)3-5 calls Ɨ 150ms3-5 calls Ɨ 100ms
Multi-agent debate6-10 calls Ɨ 150ms6-10 calls Ɨ 100ms
Tool-use orchestration2-8 calls Ɨ 150ms2-8 calls Ɨ 100ms
Chain-of-thought agents1 call Ɨ 300ms1 call Ɨ 200ms

Key insight: For single-turn inference, INT4 vs NVFP4 is a latency difference of ~30%. For multi-turn agent architectures with 5-10 sequential LLM calls, the compounded difference becomes significant — 1.5 seconds vs 1.0 second for a typical agent interaction, which matters for user experience.

Decision Flowchart

text
Need to quantize a model for inference?
ā”œā”€ā”€ Running on H100/A100?
│   └── Use INT4 (GPTQ/AWQ) — NVFP4 not available
ā”œā”€ā”€ Running on B200 (Blackwell)?
│   ā”œā”€ā”€ Agent-like multi-turn workload?
│   │   └── Use NVFP4 — compound latency savings
│   ā”œā”€ā”€ Batch throughput critical?
│   │   └── Use NVFP4 — 1.5-2x effective throughput
│   ā”œā”€ā”€ Outlier-heavy model (MoE, very large)?
│   │   └── Use NVFP4 — FP format handles outliers
│   └── Simple chat / single-turn?
│       └── INT4 (AWQ) is fine — maturity advantage
ā”œā”€ā”€ Running on CPU / edge?
│   └── Use INT4 (GGUF) — only option
└── Training?
    └── None — use BF16/FP8. INT4/NVFP4 are inference-only

Learn more at NVIDIA Blackwell Architecture and TensorRT-LLM Quantization Guide.