What is Float Point 8 (FP8) quantization?

Question

Accepted Answer

## FP8 Quantization in AI

**FP8 (8-bit Floating Point)** is a quantization format that uses 8 bits to represent floating-point numbers, offering a middle ground between FP16 (16-bit, higher precision) and INT8 (8-bit integer, lower quality).

### Why FP8?

FP8 overcomes the key limitation of INT8: integer formats don't handle the wide dynamic range of neural network activations well. FP8 uses floating-point representation at 8 bits, preserving more dynamic range.

### FP8 Variants

There are two FP8 variants, trading mantissa bits for exponent bits:

| Format | Exponent bits | Mantissa bits | Best For |
|--------|--------------|--------------|---------|
| **E4M3** | 4 | 3 | Weights (narrow range, high precision) |
| **E5M2** | 5 | 2 | Gradients (wide range needed) |

For comparison:
- FP32: 1 sign + 8 exponent + 23 mantissa = 32 bits
- FP16: 1 sign + 5 exponent + 10 mantissa = 16 bits
- FP8 E4M3: 1 sign + 4 exponent + 3 mantissa = 8 bits
- INT8: no exponent, just 8-bit integer

### Hardware Support

FP8 is natively supported on:
- **NVIDIA H100** — dedicated FP8 Tensor Cores
- **NVIDIA H200** — improved FP8 support
- **AMD MI300X** — FP8 support

With H100 FP8: ~2x throughput vs FP16, ~4x vs FP32.

### Using FP8 with Transformer Engine

```python
import torch
import transformer_engine.pytorch as te
from transformer_engine.common import recipe

# FP8 training recipe
fp8_recipe = recipe.DelayedScaling(
    margin=0,
    interval=1,
    fp8_format=recipe.Format.E4M3FNUZ,  # Use E4M3 for weights
    amax_history_len=16,
    amax_compute_algo="max"
)

# Replace standard layers with FP8-aware layers
model = te.Linear(in_features=1024, out_features=1024)

# Train with FP8
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    output = model(input_tensor)
    loss = criterion(output, target)
    loss.backward()
```

### FP8 Inference with vLLM

```python
from vllm import LLM, SamplingParams

# Load model with FP8 quantization for fast inference
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    quantization="fp8",  # FP8 inference
    dtype="auto"
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain FP8 quantization"], sampling_params)
print(outputs[0].outputs[0].text)
```

### FP8 vs Other Formats

| Format | Bits | Throughput vs FP32 | Quality |
|--------|------|-------------------|---------|
| FP32 | 32 | 1x | Highest |
| BF16 | 16 | ~2x | Near-lossless |
| FP16 | 16 | ~2x | Near-lossless |
| INT8 | 8 | ~4x | Good (activations tricky) |
| **FP8 E4M3** | **8** | **~4x** | **Very good** |
| INT4 | 4 | ~8x | Acceptable |

### When to Use FP8

| Use Case | Recommendation |
|---------|---------------|
| **H100 training** | FP8 — best throughput/quality ratio |
| **H100 inference** | FP8 or FP16 both viable |
| **Older GPUs (A100)** | FP16 or INT8 (no native FP8) |
| **Consumer GPUs** | INT4 (GGUF) or INT8 |
| **Maximum accuracy** | FP32 (research) or BF16 (production) |

FP8 is the preferred format for training and serving large models on H100/H200 hardware due to its excellent throughput while maintaining near-FP16 accuracy.

What is Float Point 8 (FP8) quantization?

Answer

FP8 Quantization in AI

Why FP8?

FP8 Variants

Hardware Support

Using FP8 with Transformer Engine

FP8 Inference with vLLM

FP8 vs Other Formats

When to Use FP8

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Format	Exponent bits	Mantissa bits	Best For
E4M3	4	3	Weights (narrow range, high precision)
E5M2	5	2	Gradients (wide range needed)

Format	Bits	Throughput vs FP32	Quality
FP32	32	1x	Highest
BF16	16	~2x	Near-lossless
FP16	16	~2x	Near-lossless
INT8	8	~4x	Good (activations tricky)
FP8 E4M3	8	~4x	Very good
INT4	4	~8x	Acceptable

Use Case	Recommendation
H100 training	FP8 — best throughput/quality ratio
H100 inference	FP8 or FP16 both viable
Older GPUs (A100)	FP16 or INT8 (no native FP8)
Consumer GPUs	INT4 (GGUF) or INT8
Maximum accuracy	FP32 (research) or BF16 (production)