What are all the different model formats in LLM?
Answer
LLM Model Formats
When you download an LLM, the model weights come in a specific file format. Each format serves different purposes — some optimize for inference speed, others for storage efficiency, and some for cross-framework compatibility.
Overview of Model Formats
| Format | Type | Compression | Best For |
|---|---|---|---|
| PyTorch (.bin / .pt) | Raw weights | None | Training, research |
| SafeTensors (.safetensors) | Raw weights | None | Safe loading (no pickle) |
| GGUF (.gguf) | Quantized | Yes (INT4/INT8) | CPU inference (llama.cpp) |
| GPTQ (.pt/.safetensors) | Quantized | Yes (INT4/INT8) | GPU inference |
| AWQ (.pt) | Quantized | Yes (INT4) | GPU inference (faster than GPTQ) |
| ONNX (.onnx) | Export format | Optional | Cross-framework deployment |
| MLX (folder bundle) | Raw/Quantized | Optional | Apple Silicon inference |
| CoreML (.mlmodelc) | Raw/Quantized | Yes | iOS/macOS on-device |
| TensorFlow (.h5 / SavedModel) | Raw weights | None | TensorFlow ecosystem |
| ExLlamaV2 (.safetensors) | Quantized | Yes (INT4/INT8) | High-throughput GPU inference |
| HQQ (.safetensors) | Quantized | Yes (INT4) | Ultra-fast on-the-fly quantization |
Raw/Uncompressed Formats
python# PyTorch .bin — traditional format from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", torch_dtype="auto", ) # Loads model-00001-of-00002.bin, model-00002-of-00002.bin
SafeTensors is now the preferred raw format — it avoids pickle-based deserialization (security risk) and supports zero-copy loading:
python# SafeTensors — safer, faster model = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-Instruct-v0.2", use_safetensors=True, # Default for most modern models ) # Loads model-00001-of-00002.safetensors, etc.
Quantized Formats — The Practical Choice
| Format | Bits | Quality Loss | VRAM (7B model) | Tool |
|---|---|---|---|---|
| GGUF Q4_K_M | 4-bit | Low | ~4.3 GB | llama.cpp, Ollama |
| GGUF Q8_0 | 8-bit | Negligible | ~7.7 GB | llama.cpp |
| GPTQ INT4 | 4-bit | Low | ~3.9 GB | AutoGPTQ, vLLM |
| AWQ INT4 | 4-bit | Very Low | ~3.9 GB | AutoAWQ, vLLM |
| ExLlamaV2 4bpw | 4-bit | Very Low | ~3.9 GB | ExLlamaV2 |
GGUF — The CPU/Edge King
bash# Download a GGUF model and run with llama.cpp wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf # Run with llama.cpp ./main -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "What is RAG?" -n 256 # Or with Ollama (which uses GGUF internally) ollama pull mistral
GGUF quantization levels:
| Quant | Bits | Quality |
|---|---|---|
| Q2_K | 2.5 | Very low (aggressive) |
| Q3_K_S/M/L | 3 | Low |
| Q4_K_S/M | 4 | Good (best value) |
| Q5_K_S/M | 5 | Very good |
| Q6_K | 6 | Excellent |
| Q8_0 | 8 | Near-perfect |
| F16 | 16 | Lossless |
GPTQ vs AWQ — GPU Quantized Formats
Both achieve 4-bit inference on GPU, but they differ in approach:
| Feature | GPTQ | AWQ |
|---|---|---|
| Quantization method | Layer-wise, one-shot | Activation-aware |
| Inference speed | Fast | Faster (~1.1-1.4x) |
| Quality (INT4) | Good | Very good |
| Library | text | text |
| vLLM support | Yes | Yes |
| Memory | Similar | Slightly lower |
python# Load GPTQ model model = AutoModelForCausalLM.from_pretrained( "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ", device_map="auto", ) # Load AWQ model model = AutoModelForCausalLM.from_pretrained( "TheBloke/Mistral-7B-Instruct-v0.2-AWQ", device_map="auto", )
ONNX — Cross-Platform Standard
ONNX exports the model to an intermediate representation that runs on any ONNX Runtime — useful for Triton Inference Server, Azure ML, and Windows deployment:
bash# Export to ONNX optimum-cli export onnx --model mistralai/Mistral-7B-v0.1 --task text-generation ./mistral-onnx # Load with ONNX Runtime from optimum.onnxruntime import ORTModelForCausalLM model = ORTModelForCausalLM.from_pretrained("./mistral-onnx")
Choosing the Right Format
| Use Case | Recommended Format |
|---|---|
| Research / Fine-tuning | SafeTensors (.safetensors) |
| CPU inference (laptop/server) | GGUF (Q4_K_M) |
| Apple Silicon (M1/M2/M3) | MLX or GGUF |
| GPU serving (vLLM / TGI) | AWQ or GPTQ |
| Edge / mobile | CoreML, GGUF (small quants) |
| Cross-platform / enterprise | ONNX |
| Max throughput (batching) | ExLlamaV2 |
| Local dev (LM Studio / Ollama) | GGUF |
Rule of thumb: Start with SafeTensors for flexibility. For CPU/local use, go GGUF. For GPU serving at scale, pick AWQ. The format should match your deployment target — not your ego.
Learn more at llama.cpp GitHub, AutoAWQ, and HuggingFace Optimum.