Does the speed of running local AI (LLM) models on a GPU depend more on the type of GPU used, or on how much VRAM the GPU has?
Answer
GPU Speed vs VRAM for Local LLM Inference
The Short Answer
Both GPU architecture and VRAM matter, but for different reasons. VRAM is the gatekeeper — it determines which models you can run at all. GPU architecture determines speed — once the model fits, memory bandwidth, Tensor Cores, and cache hierarchy drive tokens/second.
If a model doesn't fit in VRAM, inference crashes or falls back to CPU/RAM (100-1000x slower). If it fits, a faster GPU with the same VRAM produces 2-3x more tokens/second.
VRAM: What Models Can You Run?
| VRAM Role | Per-Billion-Parameter Cost |
|---|---|
| Model weights (FP16) | ~2 GB per billion parameters |
| Model weights (Q4_K_M) | ~0.5 GB per billion parameters |
| KV cache | ~0.5-1.5 GB per 8K tokens (model-dependent) |
| Overhead (activations, buffers) | ~1-2 GB |
python# Estimate VRAM needed for local LLM inference def estimate_vram(model_params_B, quant_bits, context_len, batch_size=1): weight_gb = model_params_B * (quant_bits / 8) kv_cache_gb = context_len * 0.001 # rough: ~1 MB per token overhead_gb = 2.0 # activations, CUDA context, buffers return weight_gb + kv_cache_gb + overhead_gb # Llama 3.1 8B at Q4 with 8K context print(f"8B Q4 8K: {estimate_vram(8, 4, 8192):.1f} GB") # ~12 GB print(f"8B FP16 8K: {estimate_vram(8, 16, 8192):.1f} GB") # ~26 GB print(f"70B Q4 4K: {estimate_vram(70, 4, 4096):.1f} GB") # ~41 GB
| GPU | VRAM | What Fits (Q4 quantization) |
|---|---|---|
| RTX 3060 | 12 GB | 7-8B params (Llama 3.1 8B, Mistral 7B) |
| RTX 4070 | 12 GB | 7-8B params + moderate context |
| RTX 4080 | 16 GB | 13B params or 7B with long context |
| RTX 4090 | 24 GB | 34B params or Mixtral 8x7B MoE |
| RTX 6000 Ada | 48 GB | 70B params (Llama 3 70B Q4) |
| A100 | 40/80 GB | 70B FP16, 180B+ Q4 |
| Mac M2 Ultra | 192 GB (unified) | 405B Q4 (slow but fits) |
GPU Architecture: What Determines Speed
For a model that fits in VRAM, these factors determine tokens/second:
| Architectural Factor | What It Controls | Impact on Speed |
|---|---|---|
| Memory Bandwidth | How fast data moves VRAM → GPU | Dominant factor — most inference is memory-bound |
| Tensor Cores | Matrix multiplication throughput | Affects prompt processing (prefill) and batched inference |
| L2 Cache Size | Reduces expensive VRAM round-trips | Lower latency per token |
| Clock Speed | Overall compute rate | Modest effect (5-10%) |
| Arch Generation | New instruction support (FP8, FP4) | Efficiency gains on newer architectures |
The Memory Bandwidth Formula
Memory bandwidth is the primary bottleneck for single-user local inference:
The 1.2x overhead accounts for KV cache reads/writes during autoregressive generation.
python# Token speed estimation def estimate_tps(bandwidth_gbps, model_gb): return bandwidth_gbps / (model_gb * 1.2) # RTX 4090 (1008 GB/s) with Llama 8B Q4 (~5 GB) print(f"4090: {estimate_tps(1008, 5):.0f} tok/s") # ~168 tok/s # RTX 3070 (448 GB/s) with Llama 8B Q4 (~5 GB) print(f"3070: {estimate_tps(448, 5):.0f} tok/s") # ~75 tok/s # RTX 4070 (504 GB/s) with Llama 8B Q4 (~5 GB) print(f"4070: {estimate_tps(504, 5):.0f} tok/s") # ~84 tok/s
Real-World Performance Comparison
| GPU | VRAM | Bandwidth | Tensor Cores | L2 Cache | Llama 3.1 8B Q4 (tok/s) | Gen |
|---|---|---|---|---|---|---|
| RTX 3070 | 8 GB | 448 GB/s | 184 (3rd gen) | 4 MB | ~75 | Ampere |
| RTX 3080 | 10 GB | 760 GB/s | 272 (3rd gen) | 5 MB | ~110 | Ampere |
| RTX 4070 | 12 GB | 504 GB/s | 184 (4th gen) | 36 MB | ~85 | Ada |
| RTX 4080 | 16 GB | 717 GB/s | 304 (4th gen) | 64 MB | ~115 | Ada |
| RTX 4090 | 24 GB | 1008 GB/s | 512 (4th gen) | 72 MB | ~165 | Ada |
| Mac M2 Ultra | 192 GB | 800 GB/s | N/A (ANE) | — | ~25 | Apple |
Key insight: The RTX 4070 has more VRAM (12 GB) than the RTX 3080 (10 GB), but the RTX 3080 is faster (110 vs 85 tok/s) because of its higher memory bandwidth (760 vs 504 GB/s). VRAM doesn't make you fast — it makes you capable.
When VRAM Becomes the Speed Bottleneck
VRAM indirectly affects speed in these scenarios:
| Scenario | Problem | Impact |
|---|---|---|
| Model doesn't fit | Falls back to CPU/RAM or fails | 100-1000x slower |
| VRAM is nearly full | Fragmentation causes extra allocations | 10-30% slower |
| GQA/Long context | KV cache competes with weights | Limits max context length |
| Multi-GPU split | Layer distribution overhead | 5-15% slower than single GPU |
Decision Framework
Practical Recommendations
For 7-8B models (Mistral, Llama 3.1, Qwen 2.5):
- Minimum: 8 GB VRAM (Q4 quantization, short context)
- Comfortable: 12-16 GB VRAM (Q4/Q5 with 8K+ context)
- Ideal: 16-24 GB VRAM (Q6/Q8, long context, prompt caching)
For 13-34B models:
- Minimum: 16 GB VRAM (Q4, short context)
- Comfortable: 24 GB VRAM (Q4, 4K+ context)
- Ideal: 48+ GB VRAM (Q6, full context)
For 70B+ models:
- Minimum: 48 GB VRAM (Q4, ~2K context) or dual 24 GB GPUs
- Comfortable: 80 GB VRAM (Q4, 4K+ context)
- Ideal: A100/H100 (80 GB) or Mac Studio (192 GB unified)
Summary
| Priority | Factor | Why It Matters |
|---|---|---|
| 1st | VRAM Capacity | Determines if you can run the model at all |
| 2nd | Memory Bandwidth | Determines how fast (dominant bottleneck for inference) |
| 3rd | Architecture Generation | Efficiency, new precision formats, cache improvements |
| 4th | Compute (CUDA/Tensor Cores) | Matters for prompt prefill, batched inference, and training — not single-token decode |
The rule of thumb: Buy the most VRAM your budget allows. Among GPUs with similar VRAM, pick the one with higher memory bandwidth. Architecture generation is the tiebreaker when VRAM and bandwidth are similar.
Learn more at r/LocalLLaMA GPU Guide and llama.cpp hardware discussion.