What are all the different model formats in LLM?

Question

Accepted Answer

## LLM Model Formats

When you download an LLM, the model weights come in a specific file format. Each format serves different purposes — some optimize for inference speed, others for storage efficiency, and some for cross-framework compatibility.

### Overview of Model Formats

| Format | Type | Compression | Best For |
|--------|------|-------------|----------|
| **PyTorch (.bin / .pt)** | Raw weights | None | Training, research |
| **SafeTensors (.safetensors)** | Raw weights | None | Safe loading (no pickle) |
| **GGUF (.gguf)** | Quantized | Yes (INT4/INT8) | CPU inference (llama.cpp) |
| **GPTQ (.pt/.safetensors)** | Quantized | Yes (INT4/INT8) | GPU inference |
| **AWQ (.pt)** | Quantized | Yes (INT4) | GPU inference (faster than GPTQ) |
| **ONNX (.onnx)** | Export format | Optional | Cross-framework deployment |
| **MLX (folder bundle)** | Raw/Quantized | Optional | Apple Silicon inference |
| **CoreML (.mlmodelc)** | Raw/Quantized | Yes | iOS/macOS on-device |
| **TensorFlow (.h5 / SavedModel)** | Raw weights | None | TensorFlow ecosystem |
| **ExLlamaV2 (.safetensors)** | Quantized | Yes (INT4/INT8) | High-throughput GPU inference |
| **HQQ (.safetensors)** | Quantized | Yes (INT4) | Ultra-fast on-the-fly quantization |

### Raw/Uncompressed Formats

```python
# PyTorch .bin — traditional format
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype="auto",
)
# Loads model-00001-of-00002.bin, model-00002-of-00002.bin
```

**SafeTensors** is now the preferred raw format — it avoids pickle-based deserialization (security risk) and supports zero-copy loading:

```python
# SafeTensors — safer, faster
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    use_safetensors=True,  # Default for most modern models
)
# Loads model-00001-of-00002.safetensors, etc.
```

### Quantized Formats — The Practical Choice

| Format | Bits | Quality Loss | VRAM (7B model) | Tool |
|--------|------|-------------|-----------------|------|
| **GGUF Q4_K_M** | 4-bit | Low | ~4.3 GB | llama.cpp, Ollama |
| **GGUF Q8_0** | 8-bit | Negligible | ~7.7 GB | llama.cpp |
| **GPTQ INT4** | 4-bit | Low | ~3.9 GB | AutoGPTQ, vLLM |
| **AWQ INT4** | 4-bit | Very Low | ~3.9 GB | AutoAWQ, vLLM |
| **ExLlamaV2 4bpw** | 4-bit | Very Low | ~3.9 GB | ExLlamaV2 |

### GGUF — The CPU/Edge King

```bash
# Download a GGUF model and run with llama.cpp
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Run with llama.cpp
./main -m mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "What is RAG?" -n 256

# Or with Ollama (which uses GGUF internally)
ollama pull mistral
```

**GGUF quantization levels:**

| Quant | Bits | Quality |
|-------|------|---------|
| Q2_K | 2.5 | Very low (aggressive) |
| Q3_K_S/M/L | 3 | Low |
| Q4_K_S/M | 4 | Good (best value) |
| Q5_K_S/M | 5 | Very good |
| Q6_K | 6 | Excellent |
| Q8_0 | 8 | Near-perfect |
| F16 | 16 | Lossless |

### GPTQ vs AWQ — GPU Quantized Formats

Both achieve 4-bit inference on GPU, but they differ in approach:

| Feature | GPTQ | AWQ |
|---------|------|-----|
| **Quantization method** | Layer-wise, one-shot | Activation-aware |
| **Inference speed** | Fast | Faster (~1.1-1.4x) |
| **Quality (INT4)** | Good | Very good |
| **Library** | `auto-gptq` | `autoawq` |
| **vLLM support** | Yes | Yes |
| **Memory** | Similar | Slightly lower |

```python
# Load GPTQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
    device_map="auto",
)

# Load AWQ model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
    device_map="auto",
)
```

### ONNX — Cross-Platform Standard

ONNX exports the model to an intermediate representation that runs on any ONNX Runtime — useful for Triton Inference Server, Azure ML, and Windows deployment:

```bash
# Export to ONNX
optimum-cli export onnx   --model mistralai/Mistral-7B-v0.1   --task text-generation   ./mistral-onnx

# Load with ONNX Runtime
from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained("./mistral-onnx")
```

### Choosing the Right Format

| Use Case | Recommended Format |
|----------|-------------------|
| **Research / Fine-tuning** | SafeTensors (.safetensors) |
| **CPU inference (laptop/server)** | GGUF (Q4_K_M) |
| **Apple Silicon (M1/M2/M3)** | MLX or GGUF |
| **GPU serving (vLLM / TGI)** | AWQ or GPTQ |
| **Edge / mobile** | CoreML, GGUF (small quants) |
| **Cross-platform / enterprise** | ONNX |
| **Max throughput (batching)** | ExLlamaV2 |
| **Local dev (LM Studio / Ollama)** | GGUF |

> **Rule of thumb:** Start with SafeTensors for flexibility. For CPU/local use, go GGUF. For GPU serving at scale, pick AWQ. The format should match your deployment target — not your ego.

Learn more at [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), and [HuggingFace Optimum](https://huggingface.co/docs/optimum).

What are all the different model formats in LLM?

Answer

LLM Model Formats

Overview of Model Formats

Raw/Uncompressed Formats

Quantized Formats — The Practical Choice

GGUF — The CPU/Edge King

GPTQ vs AWQ — GPU Quantized Formats

ONNX — Cross-Platform Standard

Choosing the Right Format

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Format	Type	Compression	Best For
PyTorch (.bin / .pt)	Raw weights	None	Training, research
SafeTensors (.safetensors)	Raw weights	None	Safe loading (no pickle)
GGUF (.gguf)	Quantized	Yes (INT4/INT8)	CPU inference (llama.cpp)
GPTQ (.pt/.safetensors)	Quantized	Yes (INT4/INT8)	GPU inference
AWQ (.pt)	Quantized	Yes (INT4)	GPU inference (faster than GPTQ)
ONNX (.onnx)	Export format	Optional	Cross-framework deployment
MLX (folder bundle)	Raw/Quantized	Optional	Apple Silicon inference
CoreML (.mlmodelc)	Raw/Quantized	Yes	iOS/macOS on-device
TensorFlow (.h5 / SavedModel)	Raw weights	None	TensorFlow ecosystem
ExLlamaV2 (.safetensors)	Quantized	Yes (INT4/INT8)	High-throughput GPU inference
HQQ (.safetensors)	Quantized	Yes (INT4)	Ultra-fast on-the-fly quantization

Format	Bits	Quality Loss	VRAM (7B model)	Tool
GGUF Q4_K_M	4-bit	Low	~4.3 GB	llama.cpp, Ollama
GGUF Q8_0	8-bit	Negligible	~7.7 GB	llama.cpp
GPTQ INT4	4-bit	Low	~3.9 GB	AutoGPTQ, vLLM
AWQ INT4	4-bit	Very Low	~3.9 GB	AutoAWQ, vLLM
ExLlamaV2 4bpw	4-bit	Very Low	~3.9 GB	ExLlamaV2

Quant	Bits	Quality
Q2_K	2.5	Very low (aggressive)
Q3_K_S/M/L	3	Low
Q4_K_S/M	4	Good (best value)
Q5_K_S/M	5	Very good
Q6_K	6	Excellent
Q8_0	8	Near-perfect
F16	16	Lossless

Feature	GPTQ	AWQ
Quantization method	Layer-wise, one-shot	Activation-aware
Inference speed	Fast	Faster (~1.1-1.4x)
Quality (INT4)	Good	Very good
Library	text `auto-gptq`	text `autoawq`
vLLM support	Yes	Yes
Memory	Similar	Slightly lower

Use Case	Recommended Format
Research / Fine-tuning	SafeTensors (.safetensors)
CPU inference (laptop/server)	GGUF (Q4_K_M)
Apple Silicon (M1/M2/M3)	MLX or GGUF
GPU serving (vLLM / TGI)	AWQ or GPTQ
Edge / mobile	CoreML, GGUF (small quants)
Cross-platform / enterprise	ONNX
Max throughput (batching)	ExLlamaV2
Local dev (LM Studio / Ollama)	GGUF