Does the speed of running local AI (LLM) models on a GPU depend more on the type of GPU used, or on how much VRAM the GPU has?

Question

Accepted Answer

## GPU Speed vs VRAM for Local LLM Inference

### The Short Answer

Both GPU architecture and VRAM matter, but for different reasons. **VRAM is the gatekeeper** — it determines which models you can run at all. **GPU architecture determines speed** — once the model fits, memory bandwidth, Tensor Cores, and cache hierarchy drive tokens/second.

If a model doesn't fit in VRAM, inference crashes or falls back to CPU/RAM (100-1000x slower). If it fits, a faster GPU with the same VRAM produces 2-3x more tokens/second.

### VRAM: What Models Can You Run?

| VRAM Role | Per-Billion-Parameter Cost |
|-----------|---------------------------|
| Model weights (FP16) | ~2 GB per billion parameters |
| Model weights (Q4_K_M) | ~0.5 GB per billion parameters |
| KV cache | ~0.5-1.5 GB per 8K tokens (model-dependent) |
| Overhead (activations, buffers) | ~1-2 GB |

```python
# Estimate VRAM needed for local LLM inference
def estimate_vram(model_params_B, quant_bits, context_len, batch_size=1):
    weight_gb = model_params_B * (quant_bits / 8)
    kv_cache_gb = context_len * 0.001  # rough: ~1 MB per token
    overhead_gb = 2.0  # activations, CUDA context, buffers
    return weight_gb + kv_cache_gb + overhead_gb

# Llama 3.1 8B at Q4 with 8K context
print(f"8B Q4 8K: {estimate_vram(8, 4, 8192):.1f} GB")  # ~12 GB
print(f"8B FP16 8K: {estimate_vram(8, 16, 8192):.1f} GB")  # ~26 GB
print(f"70B Q4 4K: {estimate_vram(70, 4, 4096):.1f} GB")  # ~41 GB
```

| GPU | VRAM | What Fits (Q4 quantization) |
|-----|------|---------------------------|
| RTX 3060 | 12 GB | 7-8B params (Llama 3.1 8B, Mistral 7B) |
| RTX 4070 | 12 GB | 7-8B params + moderate context |
| RTX 4080 | 16 GB | 13B params or 7B with long context |
| RTX 4090 | 24 GB | 34B params or Mixtral 8x7B MoE |
| RTX 6000 Ada | 48 GB | 70B params (Llama 3 70B Q4) |
| A100 | 40/80 GB | 70B FP16, 180B+ Q4 |
| Mac M2 Ultra | 192 GB (unified) | 405B Q4 (slow but fits) |

### GPU Architecture: What Determines Speed

For a model that fits in VRAM, these factors determine tokens/second:

| Architectural Factor | What It Controls | Impact on Speed |
|---------------------|-----------------|----------------|
| **Memory Bandwidth** | How fast data moves VRAM → GPU | **Dominant factor** — most inference is memory-bound |
| **Tensor Cores** | Matrix multiplication throughput | Affects prompt processing (prefill) and batched inference |
| **L2 Cache Size** | Reduces expensive VRAM round-trips | Lower latency per token |
| **Clock Speed** | Overall compute rate | Modest effect (5-10%) |
| **Arch Generation** | New instruction support (FP8, FP4) | Efficiency gains on newer architectures |

### The Memory Bandwidth Formula

Memory bandwidth is the primary bottleneck for single-user local inference:

$$	ext{Tokens/second} \approx \frac{	ext{Memory Bandwidth (GB/s)}}{	ext{Model Size in VRAM (GB)} 	imes 1.2}$$

The 1.2x overhead accounts for KV cache reads/writes during autoregressive generation.

```python
# Token speed estimation
def estimate_tps(bandwidth_gbps, model_gb):
    return bandwidth_gbps / (model_gb * 1.2)

# RTX 4090 (1008 GB/s) with Llama 8B Q4 (~5 GB)
print(f"4090: {estimate_tps(1008, 5):.0f} tok/s")   # ~168 tok/s

# RTX 3070 (448 GB/s) with Llama 8B Q4 (~5 GB)
print(f"3070: {estimate_tps(448, 5):.0f} tok/s")    #  ~75 tok/s

# RTX 4070 (504 GB/s) with Llama 8B Q4 (~5 GB)
print(f"4070: {estimate_tps(504, 5):.0f} tok/s")    #  ~84 tok/s
```

### Real-World Performance Comparison

| GPU | VRAM | Bandwidth | Tensor Cores | L2 Cache | Llama 3.1 8B Q4 (tok/s) | Gen |
|-----|------|-----------|-------------|----------|--------------------------|-----|
| RTX 3070 | 8 GB | 448 GB/s | 184 (3rd gen) | 4 MB | ~75 | Ampere |
| RTX 3080 | 10 GB | 760 GB/s | 272 (3rd gen) | 5 MB | ~110 | Ampere |
| RTX 4070 | 12 GB | 504 GB/s | 184 (4th gen) | 36 MB | ~85 | Ada |
| RTX 4080 | 16 GB | 717 GB/s | 304 (4th gen) | 64 MB | ~115 | Ada |
| RTX 4090 | 24 GB | 1008 GB/s | 512 (4th gen) | 72 MB | ~165 | Ada |
| Mac M2 Ultra | 192 GB | 800 GB/s | N/A (ANE) | — | ~25 | Apple |

> **Key insight:** The RTX 4070 has more VRAM (12 GB) than the RTX 3080 (10 GB), but the RTX 3080 is faster (110 vs 85 tok/s) because of its higher memory bandwidth (760 vs 504 GB/s). VRAM doesn't make you fast — it makes you *capable*.

### When VRAM Becomes the Speed Bottleneck

VRAM indirectly affects speed in these scenarios:

| Scenario | Problem | Impact |
|----------|---------|--------|
| Model **doesn't fit** | Falls back to CPU/RAM or fails | 100-1000x slower |
| VRAM is **nearly full** | Fragmentation causes extra allocations | 10-30% slower |
| **GQA/Long context** | KV cache competes with weights | Limits max context length |
| **Multi-GPU split** | Layer distribution overhead | 5-15% slower than single GPU |

### Decision Framework

```mermaid
graph TD
    Q[Choosing a GPU for Local LLMs] --> VRAM{Does your target model fit?}
    VRAM -->|No| A[More VRAM or lower quantization]
    VRAM -->|Yes| BW{Memory bandwidth}
    BW --> B[Higher BW = more tok/s]
    B --> Arch[Architecture generation as tiebreaker]
    Arch --> C[FP8 support, larger cache, better Tensor Cores]
    A --> Final[Same VRAM, pick higher bandwidth]
    C --> Final
```

### Practical Recommendations

**For 7-8B models (Mistral, Llama 3.1, Qwen 2.5):**

* **Minimum:** 8 GB VRAM (Q4 quantization, short context)
* **Comfortable:** 12-16 GB VRAM (Q4/Q5 with 8K+ context)
* **Ideal:** 16-24 GB VRAM (Q6/Q8, long context, prompt caching)

**For 13-34B models:**

* **Minimum:** 16 GB VRAM (Q4, short context)
* **Comfortable:** 24 GB VRAM (Q4, 4K+ context)
* **Ideal:** 48+ GB VRAM (Q6, full context)

**For 70B+ models:**

* **Minimum:** 48 GB VRAM (Q4, ~2K context) or dual 24 GB GPUs
* **Comfortable:** 80 GB VRAM (Q4, 4K+ context)
* **Ideal:** A100/H100 (80 GB) or Mac Studio (192 GB unified)

### Summary

| Priority | Factor | Why It Matters |
|----------|--------|---------------|
| **1st** | VRAM Capacity | Determines *if* you can run the model at all |
| **2nd** | Memory Bandwidth | Determines *how fast* (dominant bottleneck for inference) |
| **3rd** | Architecture Generation | Efficiency, new precision formats, cache improvements |
| **4th** | Compute (CUDA/Tensor Cores) | Matters for prompt prefill, batched inference, and training — not single-token decode |

> **The rule of thumb:** Buy the most VRAM your budget allows. Among GPUs with similar VRAM, pick the one with higher memory bandwidth. Architecture generation is the tiebreaker when VRAM and bandwidth are similar.

Learn more at [r/LocalLLaMA GPU Guide](https://www.reddit.com/r/LocalLLaMA/wiki/index/) and [llama.cpp hardware discussion](https://github.com/ggerganov/llama.cpp/discussions).

Does the speed of running local AI (LLM) models on a GPU depend more on the type of GPU used, or on how much VRAM the GPU has?

Answer

GPU Speed vs VRAM for Local LLM Inference

The Short Answer

VRAM: What Models Can You Run?

GPU Architecture: What Determines Speed

The Memory Bandwidth Formula

Real-World Performance Comparison

When VRAM Becomes the Speed Bottleneck

Decision Framework

Practical Recommendations

Summary

Related Concepts

How would you monitor a deployed LLM application?

What's your strategy for handling model updates in production?

How would you reduce inference latency for an LLM application?

How would you estimate costs for a large-scale LLM application?

What's your testing strategy for Gen AI applications?

VRAM Role	Per-Billion-Parameter Cost
Model weights (FP16)	~2 GB per billion parameters
Model weights (Q4_K_M)	~0.5 GB per billion parameters
KV cache	~0.5-1.5 GB per 8K tokens (model-dependent)
Overhead (activations, buffers)	~1-2 GB

GPU	VRAM	What Fits (Q4 quantization)
RTX 3060	12 GB	7-8B params (Llama 3.1 8B, Mistral 7B)
RTX 4070	12 GB	7-8B params + moderate context
RTX 4080	16 GB	13B params or 7B with long context
RTX 4090	24 GB	34B params or Mixtral 8x7B MoE
RTX 6000 Ada	48 GB	70B params (Llama 3 70B Q4)
A100	40/80 GB	70B FP16, 180B+ Q4
Mac M2 Ultra	192 GB (unified)	405B Q4 (slow but fits)

Architectural Factor	What It Controls	Impact on Speed
Memory Bandwidth	How fast data moves VRAM → GPU	Dominant factor — most inference is memory-bound
Tensor Cores	Matrix multiplication throughput	Affects prompt processing (prefill) and batched inference
L2 Cache Size	Reduces expensive VRAM round-trips	Lower latency per token
Clock Speed	Overall compute rate	Modest effect (5-10%)
Arch Generation	New instruction support (FP8, FP4)	Efficiency gains on newer architectures

GPU	VRAM	Bandwidth	Tensor Cores	L2 Cache	Llama 3.1 8B Q4 (tok/s)	Gen
RTX 3070	8 GB	448 GB/s	184 (3rd gen)	4 MB	~75	Ampere
RTX 3080	10 GB	760 GB/s	272 (3rd gen)	5 MB	~110	Ampere
RTX 4070	12 GB	504 GB/s	184 (4th gen)	36 MB	~85	Ada
RTX 4080	16 GB	717 GB/s	304 (4th gen)	64 MB	~115	Ada
RTX 4090	24 GB	1008 GB/s	512 (4th gen)	72 MB	~165	Ada
Mac M2 Ultra	192 GB	800 GB/s	N/A (ANE)	—	~25	Apple

Scenario	Problem	Impact
Model doesn't fit	Falls back to CPU/RAM or fails	100-1000x slower
VRAM is nearly full	Fragmentation causes extra allocations	10-30% slower
GQA/Long context	KV cache competes with weights	Limits max context length
Multi-GPU split	Layer distribution overhead	5-15% slower than single GPU

Priority	Factor	Why It Matters
1st	VRAM Capacity	Determines if you can run the model at all
2nd	Memory Bandwidth	Determines how fast (dominant bottleneck for inference)
3rd	Architecture Generation	Efficiency, new precision formats, cache improvements
4th	Compute (CUDA/Tensor Cores)	Matters for prompt prefill, batched inference, and training — not single-token decode