Answer
FP8 Quantization in AI
FP8 (8-bit Floating Point) is a quantization format that uses 8 bits to represent floating-point numbers, offering a middle ground between FP16 (16-bit, higher precision) and INT8 (8-bit integer, lower quality).
Why FP8?
FP8 overcomes the key limitation of INT8: integer formats don't handle the wide dynamic range of neural network activations well. FP8 uses floating-point representation at 8 bits, preserving more dynamic range.
FP8 Variants
There are two FP8 variants, trading mantissa bits for exponent bits:
| Format | Exponent bits | Mantissa bits | Best For |
|---|---|---|---|
| E4M3 | 4 | 3 | Weights (narrow range, high precision) |
| E5M2 | 5 | 2 | Gradients (wide range needed) |
For comparison:
- FP32: 1 sign + 8 exponent + 23 mantissa = 32 bits
- FP16: 1 sign + 5 exponent + 10 mantissa = 16 bits
- FP8 E4M3: 1 sign + 4 exponent + 3 mantissa = 8 bits
- INT8: no exponent, just 8-bit integer
Hardware Support
FP8 is natively supported on:
- NVIDIA H100 — dedicated FP8 Tensor Cores
- NVIDIA H200 — improved FP8 support
- AMD MI300X — FP8 support
With H100 FP8: ~2x throughput vs FP16, ~4x vs FP32.
Using FP8 with Transformer Engine
pythonimport torch import transformer_engine.pytorch as te from transformer_engine.common import recipe # FP8 training recipe fp8_recipe = recipe.DelayedScaling( margin=0, interval=1, fp8_format=recipe.Format.E4M3FNUZ, # Use E4M3 for weights amax_history_len=16, amax_compute_algo="max" ) # Replace standard layers with FP8-aware layers model = te.Linear(in_features=1024, out_features=1024) # Train with FP8 with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe): output = model(input_tensor) loss = criterion(output, target) loss.backward()
FP8 Inference with vLLM
pythonfrom vllm import LLM, SamplingParams # Load model with FP8 quantization for fast inference llm = LLM( model="meta-llama/Meta-Llama-3-8B-Instruct", quantization="fp8", # FP8 inference dtype="auto" ) sampling_params = SamplingParams(temperature=0.7, max_tokens=512) outputs = llm.generate(["Explain FP8 quantization"], sampling_params) print(outputs[0].outputs[0].text)
FP8 vs Other Formats
| Format | Bits | Throughput vs FP32 | Quality |
|---|---|---|---|
| FP32 | 32 | 1x | Highest |
| BF16 | 16 | ~2x | Near-lossless |
| FP16 | 16 | ~2x | Near-lossless |
| INT8 | 8 | ~4x | Good (activations tricky) |
| FP8 E4M3 | 8 | ~4x | Very good |
| INT4 | 4 | ~8x | Acceptable |
When to Use FP8
| Use Case | Recommendation |
|---|---|
| H100 training | FP8 — best throughput/quality ratio |
| H100 inference | FP8 or FP16 both viable |
| Older GPUs (A100) | FP16 or INT8 (no native FP8) |
| Consumer GPUs | INT4 (GGUF) or INT8 |
| Maximum accuracy | FP32 (research) or BF16 (production) |
FP8 is the preferred format for training and serving large models on H100/H200 hardware due to its excellent throughput while maintaining near-FP16 accuracy.