What are all the different model formats in LLM?

#gen-ai#llm#model-formats#gguf#gptq#awq#safetensors#onnx#quantization#deployment

Answer

LLM Model Formats

When you download an LLM, the model weights come in a specific file format. Each format serves different purposes — some optimize for inference speed, others for storage efficiency, and some for cross-framework compatibility.

Overview of Model Formats

FormatTypeCompressionBest For
PyTorch (.bin / .pt)Raw weightsNoneTraining, research
SafeTensors (.safetensors)Raw weightsNoneSafe loading (no pickle)
GGUF (.gguf)QuantizedYes (INT4/INT8)CPU inference (llama.cpp)
GPTQ (.pt/.safetensors)QuantizedYes (INT4/INT8)GPU inference
AWQ (.pt)QuantizedYes (INT4)GPU inference (faster than GPTQ)
ONNX (.onnx)Export formatOptionalCross-framework deployment
MLX (folder bundle)Raw/QuantizedOptionalApple Silicon inference
CoreML (.mlmodelc)Raw/QuantizedYesiOS/macOS on-device
TensorFlow (.h5 / SavedModel)Raw weightsNoneTensorFlow ecosystem
ExLlamaV2 (.safetensors)QuantizedYes (INT4/INT8)High-throughput GPU inference
HQQ (.safetensors)QuantizedYes (INT4)Ultra-fast on-the-fly quantization

Raw/Uncompressed Formats

python
# PyTorch .bin — traditional format
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype="auto",
)
# Loads model-00001-of-00002.bin, model-00002-of-00002.bin

SafeTensors is now the preferred raw format — it avoids pickle-based deserialization (security risk) and supports zero-copy loading:

python
# SafeTensors — safer, faster
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    use_safetensors=True,  # Default for most modern models
)
# Loads model-00001-of-00002.safetensors, etc.

Quantized Formats — The Practical Choice

FormatBitsQuality LossVRAM (7B model)Tool
GGUF Q4_K_M4-bitLow~4.3 GBllama.cpp, Ollama
GGUF Q8_08-bitNegligible~7.7 GBllama.cpp
GPTQ INT44-bitLow~3.9 GBAutoGPTQ, vLLM
AWQ INT44-bitVery Low~3.9 GBAutoAWQ, vLLM
ExLlamaV2 4bpw4-bitVery Low~3.9 GBExLlamaV2

GGUF — The CPU/Edge King

bash
# Download a GGUF model and run with llama.cpp
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Run with llama.cpp
./main -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "What is RAG?" -n 256

# Or with Ollama (which uses GGUF internally)
ollama pull mistral

GGUF quantization levels:

QuantBitsQuality
Q2_K2.5Very low (aggressive)
Q3_K_S/M/L3Low
Q4_K_S/M4Good (best value)
Q5_K_S/M5Very good
Q6_K6Excellent
Q8_08Near-perfect
F1616Lossless

GPTQ vs AWQ — GPU Quantized Formats

Both achieve 4-bit inference on GPU, but they differ in approach:

FeatureGPTQAWQ
Quantization methodLayer-wise, one-shotActivation-aware
Inference speedFastFaster (~1.1-1.4x)
Quality (INT4)GoodVery good
Library
text
auto-gptq
text
autoawq
vLLM supportYesYes
MemorySimilarSlightly lower
python
# Load GPTQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
    device_map="auto",
)

# Load AWQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
    device_map="auto",
)

ONNX — Cross-Platform Standard

ONNX exports the model to an intermediate representation that runs on any ONNX Runtime — useful for Triton Inference Server, Azure ML, and Windows deployment:

bash
# Export to ONNX
optimum-cli export onnx   --model mistralai/Mistral-7B-v0.1   --task text-generation   ./mistral-onnx

# Load with ONNX Runtime
from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained("./mistral-onnx")

Choosing the Right Format

Use CaseRecommended Format
Research / Fine-tuningSafeTensors (.safetensors)
CPU inference (laptop/server)GGUF (Q4_K_M)
Apple Silicon (M1/M2/M3)MLX or GGUF
GPU serving (vLLM / TGI)AWQ or GPTQ
Edge / mobileCoreML, GGUF (small quants)
Cross-platform / enterpriseONNX
Max throughput (batching)ExLlamaV2
Local dev (LM Studio / Ollama)GGUF

Rule of thumb: Start with SafeTensors for flexibility. For CPU/local use, go GGUF. For GPU serving at scale, pick AWQ. The format should match your deployment target — not your ego.

Learn more at llama.cpp GitHub, AutoAWQ, and HuggingFace Optimum.