Concept #135Hardextended-ai-concepts

What is Float Point 8 (FP8) quantization?

#gen-ai#quantization

Answer

FP8 Quantization in AI

FP8 (8-bit Floating Point) is a quantization format that uses 8 bits to represent floating-point numbers, offering a middle ground between FP16 (16-bit, higher precision) and INT8 (8-bit integer, lower quality).

Why FP8?

FP8 overcomes the key limitation of INT8: integer formats don't handle the wide dynamic range of neural network activations well. FP8 uses floating-point representation at 8 bits, preserving more dynamic range.

FP8 Variants

There are two FP8 variants, trading mantissa bits for exponent bits:

FormatExponent bitsMantissa bitsBest For
E4M343Weights (narrow range, high precision)
E5M252Gradients (wide range needed)

For comparison:

  • FP32: 1 sign + 8 exponent + 23 mantissa = 32 bits
  • FP16: 1 sign + 5 exponent + 10 mantissa = 16 bits
  • FP8 E4M3: 1 sign + 4 exponent + 3 mantissa = 8 bits
  • INT8: no exponent, just 8-bit integer

Hardware Support

FP8 is natively supported on:

  • NVIDIA H100 — dedicated FP8 Tensor Cores
  • NVIDIA H200 — improved FP8 support
  • AMD MI300X — FP8 support

With H100 FP8: ~2x throughput vs FP16, ~4x vs FP32.

Using FP8 with Transformer Engine

python
import torch
import transformer_engine.pytorch as te
from transformer_engine.common import recipe

# FP8 training recipe
fp8_recipe = recipe.DelayedScaling(
    margin=0,
    interval=1,
    fp8_format=recipe.Format.E4M3FNUZ,  # Use E4M3 for weights
    amax_history_len=16,
    amax_compute_algo="max"
)

# Replace standard layers with FP8-aware layers
model = te.Linear(in_features=1024, out_features=1024)

# Train with FP8
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    output = model(input_tensor)
    loss = criterion(output, target)
    loss.backward()

FP8 Inference with vLLM

python
from vllm import LLM, SamplingParams

# Load model with FP8 quantization for fast inference
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    quantization="fp8",  # FP8 inference
    dtype="auto"
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Explain FP8 quantization"], sampling_params)
print(outputs[0].outputs[0].text)

FP8 vs Other Formats

FormatBitsThroughput vs FP32Quality
FP32321xHighest
BF1616~2xNear-lossless
FP1616~2xNear-lossless
INT88~4xGood (activations tricky)
FP8 E4M38~4xVery good
INT44~8xAcceptable

When to Use FP8

Use CaseRecommendation
H100 trainingFP8 — best throughput/quality ratio
H100 inferenceFP8 or FP16 both viable
Older GPUs (A100)FP16 or INT8 (no native FP8)
Consumer GPUsINT4 (GGUF) or INT8
Maximum accuracyFP32 (research) or BF16 (production)

FP8 is the preferred format for training and serving large models on H100/H200 hardware due to its excellent throughput while maintaining near-FP16 accuracy.