What is the internal difference between using INT4 vs NVFP4? Which is better, what internal optimizations have been done, and is this related to model training base or usage/agent architecture?

Question

Accepted Answer

## Internal Differences Between INT4 and NVFP4

The difference between INT4 and NVFP4 is not just about format — it goes down to the **silicon level**, affecting how matrix math is executed inside the GPU, whether dequantization hardware is needed, and ultimately what workloads each is optimized for.

### Internal Mathematical Pipeline

```
INT4 Inference Pipeline (on any GPU):
── INT4 Weights ──┐
                  ├─→ Dequantize (INT4 → FP16) ──→ FP16 MatMul ──→ Output
── FP16 Input ────┘      │
                    Additional memory
                    read + 2 FP16 multiplies
                    per weight element

NVFP4 Inference Pipeline (on Blackwell):
── NVFP4 Weights ─┐
                  ├─→ Native FP4 Tensor Core ──→ Output
── FP16 Input ────┘      │
                    Single-cycle FP4 multiply-accumulate
                    No dequant hardware needed
```

### Silicon-Level Internal Operations

| Internal Aspect | INT4 (Software Path) | NVFP4 (Hardware Path) |
|-----------------|---------------------|----------------------|
| **Execution unit** | Standard Tensor Core (FP16 input) | FP4-native Tensor Core (FP4 input) |
| **Dequant cycle** | 1 cycle per weight element (scale read + 1 FP16 mul) | 0 cycles (weights consumed directly) |
| **Weight fetch** | 2 reads per group (INT4 weight + FP16 scale) | 1 read per group (NVFP4 weight only) |
| **Compute pipeline** | 2-stage: dequant → matmul | 1-stage: matmul only |
| **Memory bandwidth per token** | Higher (scale tensor + weight tensor) | Lower (weight tensor only) |
| **Warp occupancy** | Lower (dequant uses registers) | Higher (more registers for actual compute) |
| **Throughput (effective)** | 60-75% of theoretical INT4 speed | 100% of theoretical FP4 speed |

```python
# Internal operation difference illustrated

# INT4: Software-emulated quantization math
# Each token input flows through this for EVERY weight:
def int4_inner_loop(input_fp16, weight_int4, scale_fp16):
    # Step 1: Dequantize (internal overhead)
    weight_fp16 = weight_int4 * scale_fp16  # Extra FP16 multiply
    
    # Step 2: Standard matmul
    return input_fp16 * weight_fp16 + accumulator
    # Actually 2 FLOPs per weight element, not 1

# NVFP4: Hardware-native fused operation
# Single Tensor Core instruction:
def nvfp4_inner_loop(input_fp16, weight_nvfp4):
    # No dequant step at all
    # Blackwell PTX: fma.rn.f16.f16.e2m1
    return f16_fp4_fused_mac(input_fp16, weight_nvfp4)
    # Exactly 1 FLOP per weight element
```

### Which Is Better? The Answer Depends on Context

| Context | Better Choice | Why |
|---------|---------------|-----|
| **Current GPUs (H100/A100)** | INT4 | NVFP4 not supported — no choice |
| **Blackwell GPUs (B200) — production serving** | NVFP4 | Native FP4 throughput, zero dequant overhead |
| **Blackwell GPUs — training** | Neither | Use FP8/BF16 for training; FP4 is for inference |
| **Model with heavy outliers** | NVFP4 | Non-uniform FP format handles tail distribution |
| **Uniform weight distribution** | INT4 | Calibrated INT4 can match or exceed NVFP4 quality |
| **CPU inference (llama.cpp)** | INT4 (GGUF) | NVFP4 requires NVIDIA GPU silicon |
| **Agent architectures (multi-step)** | NVFP4 | Lower per-token latency = faster agent pipelines |
| **Batch serving (high throughput)** | NVFP4 | 1.5-2x effective throughput on Blackwell |
| **Ecosystem maturity** | INT4 | GPTQ/AWQ tooling is stable; NVFP4 tooling is new |

### Internal Optimizations — What Each Format Optimizes For

**INT4 internal optimizations:**

| Optimization | Details |
|-------------|---------|
| Per-group scaling (group_size=64/128) | Fine-grained calibration — each group gets its own FP16 scale |
| GPTQ layer-wise Hessian | Optimal rounding decisions using second-order information |
| AWQ activation-awareness | Identifies salient weight channels and scales them up |
| ExLlamaV2 4-bit kernel fusion | Fuses dequant + matmul into a single custom CUDA kernel |
| Marlin kernel | Specialized INT4 kernel achieving 90%+ of theoretical bandwidth |

**NVFP4 internal optimizations:**

| Optimization | Details |
|-------------|---------|
| Native FP4 Tensor Core | New SM (Streaming Multiprocessor) instructions on Blackwell |
| Zero-dequant pipeline | Eliminates the entire dequant hardware stage |
| Reduced register pressure | No scale tensors in registers = 15-20% more register bandwidth for compute |
| Block-level FP4 quantization | Quantized in blocks without per-channel scaling overhead |
| Unified memory path | Weights and activations share FP format — simpler memory controller logic |
| Micro-tensor scaling | FP4 values implicitly scaled by exponent bits — no external scale tensor |

### Training Base vs Inference vs Agent Architecture

**Is NVFP4/INT4 used for model training?**
No — these are purely **inference-time** quantization formats. Training always uses FP32, BF16, or FP8 (on H100). You never train a base model in 4-bit.

```
Training Pipeline:
Full weights (BF16/FP32) → Forward pass → Loss → Backward → Update
                                      ↓
                              After training complete
                                      ↓
Inference Pipeline:                   ↓
Full weights → POST-TRAINING QUANTIZATION (INT4/NVFP4) → Deploy
```

**Inference / Usage Architecture:**
Both INT4 and NVFP4 apply only to the forward pass during inference. They have zero impact on:
- Model architecture (number of layers, attention heads, etc.)
- Training methodology (pre-training, SFT, RLHF)
- Prompt engineering or chain-of-thought behavior

```python
# Quantization is a deployment step, not a training step
from transformers import AutoModelForCausalLM

# Step 1: Train with BF16 (standard)
model = train_base_model(config)  # Returns BF16 model

# Step 2: POST-training quantization for deployment
# This is where INT4 or NVFP4 comes in — after training
quantized_model = apply_int4_quantization(model)     # Option A
quantized_model = apply_nvfp4_quantization(model)    # Option B

# Step 3: Deploy the quantized model for inference
serve_model(quantized_model)
```

**Agent Architecture:**
Quantization format significantly impacts agent performance because agents make multiple sequential LLM calls:

```
Agent Loop (multi-step):
User Query → LLM Call 1 (Plan) → LLM Call 2 (Tool call) →
LLM Call 3 (Reason) → LLM Call 4 (Respond) → User

Each call goes through the quantization pipeline.
NVFP4's lower per-token latency compounds across 4+ calls.
```

| Agent Pattern | INT4 Latency Impact | NVFP4 Latency Impact |
|--------------|-------------------|---------------------|
| ReAct (reason + act) | 3-5 calls × 150ms | 3-5 calls × 100ms |
| Multi-agent debate | 6-10 calls × 150ms | 6-10 calls × 100ms |
| Tool-use orchestration | 2-8 calls × 150ms | 2-8 calls × 100ms |
| Chain-of-thought agents | 1 call × 300ms | 1 call × 200ms |

> **Key insight:** For single-turn inference, INT4 vs NVFP4 is a latency difference of ~30%. For multi-turn agent architectures with 5-10 sequential LLM calls, the compounded difference becomes significant — 1.5 seconds vs 1.0 second for a typical agent interaction, which matters for user experience.

### Decision Flowchart

```
Need to quantize a model for inference?
├── Running on H100/A100?
│   └── Use INT4 (GPTQ/AWQ) — NVFP4 not available
├── Running on B200 (Blackwell)?
│   ├── Agent-like multi-turn workload?
│   │   └── Use NVFP4 — compound latency savings
│   ├── Batch throughput critical?
│   │   └── Use NVFP4 — 1.5-2x effective throughput
│   ├── Outlier-heavy model (MoE, very large)?
│   │   └── Use NVFP4 — FP format handles outliers
│   └── Simple chat / single-turn?
│       └── INT4 (AWQ) is fine — maturity advantage
├── Running on CPU / edge?
│   └── Use INT4 (GGUF) — only option
└── Training?
    └── None — use BF16/FP8. INT4/NVFP4 are inference-only
```

Learn more at [NVIDIA Blackwell Architecture](https://resources.nvidia.com/en-us-blackwell-architecture) and [TensorRT-LLM Quantization Guide](https://github.com/NVIDIA/TensorRT-LLM).

What is the internal difference between using INT4 vs NVFP4? Which is better, what internal optimizations have been done, and is this related to model training base or usage/agent architecture?

Answer

Internal Differences Between INT4 and NVFP4

Internal Mathematical Pipeline

Silicon-Level Internal Operations

Which Is Better? The Answer Depends on Context

Internal Optimizations — What Each Format Optimizes For

Training Base vs Inference vs Agent Architecture

Decision Flowchart

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Internal Aspect	INT4 (Software Path)	NVFP4 (Hardware Path)
Execution unit	Standard Tensor Core (FP16 input)	FP4-native Tensor Core (FP4 input)
Dequant cycle	1 cycle per weight element (scale read + 1 FP16 mul)	0 cycles (weights consumed directly)
Weight fetch	2 reads per group (INT4 weight + FP16 scale)	1 read per group (NVFP4 weight only)
Compute pipeline	2-stage: dequant → matmul	1-stage: matmul only
Memory bandwidth per token	Higher (scale tensor + weight tensor)	Lower (weight tensor only)
Warp occupancy	Lower (dequant uses registers)	Higher (more registers for actual compute)
Throughput (effective)	60-75% of theoretical INT4 speed	100% of theoretical FP4 speed

Context	Better Choice	Why
Current GPUs (H100/A100)	INT4	NVFP4 not supported — no choice
Blackwell GPUs (B200) — production serving	NVFP4	Native FP4 throughput, zero dequant overhead
Blackwell GPUs — training	Neither	Use FP8/BF16 for training; FP4 is for inference
Model with heavy outliers	NVFP4	Non-uniform FP format handles tail distribution
Uniform weight distribution	INT4	Calibrated INT4 can match or exceed NVFP4 quality
CPU inference (llama.cpp)	INT4 (GGUF)	NVFP4 requires NVIDIA GPU silicon
Agent architectures (multi-step)	NVFP4	Lower per-token latency = faster agent pipelines
Batch serving (high throughput)	NVFP4	1.5-2x effective throughput on Blackwell
Ecosystem maturity	INT4	GPTQ/AWQ tooling is stable; NVFP4 tooling is new

Optimization	Details
Per-group scaling (group_size=64/128)	Fine-grained calibration — each group gets its own FP16 scale
GPTQ layer-wise Hessian	Optimal rounding decisions using second-order information
AWQ activation-awareness	Identifies salient weight channels and scales them up
ExLlamaV2 4-bit kernel fusion	Fuses dequant + matmul into a single custom CUDA kernel
Marlin kernel	Specialized INT4 kernel achieving 90%+ of theoretical bandwidth

Optimization	Details
Native FP4 Tensor Core	New SM (Streaming Multiprocessor) instructions on Blackwell
Zero-dequant pipeline	Eliminates the entire dequant hardware stage
Reduced register pressure	No scale tensors in registers = 15-20% more register bandwidth for compute
Block-level FP4 quantization	Quantized in blocks without per-channel scaling overhead
Unified memory path	Weights and activations share FP format — simpler memory controller logic
Micro-tensor scaling	FP4 values implicitly scaled by exponent bits — no external scale tensor

Agent Pattern	INT4 Latency Impact	NVFP4 Latency Impact
ReAct (reason + act)	3-5 calls × 150ms	3-5 calls × 100ms
Multi-agent debate	6-10 calls × 150ms	6-10 calls × 100ms
Tool-use orchestration	2-8 calls × 150ms	2-8 calls × 100ms
Chain-of-thought agents	1 call × 300ms	1 call × 200ms