Concept #207Mediumproduction-mlopsimportant

Does the speed of running local AI (LLM) models on a GPU depend more on the type of GPU used, or on how much VRAM the GPU has?

#gen-ai#gpu#vram#local-llm#hardware#inference#memory-bandwidth

Answer

GPU Speed vs VRAM for Local LLM Inference

The Short Answer

Both GPU architecture and VRAM matter, but for different reasons. VRAM is the gatekeeper — it determines which models you can run at all. GPU architecture determines speed — once the model fits, memory bandwidth, Tensor Cores, and cache hierarchy drive tokens/second.

If a model doesn't fit in VRAM, inference crashes or falls back to CPU/RAM (100-1000x slower). If it fits, a faster GPU with the same VRAM produces 2-3x more tokens/second.

VRAM: What Models Can You Run?

VRAM RolePer-Billion-Parameter Cost
Model weights (FP16)~2 GB per billion parameters
Model weights (Q4_K_M)~0.5 GB per billion parameters
KV cache~0.5-1.5 GB per 8K tokens (model-dependent)
Overhead (activations, buffers)~1-2 GB
python
# Estimate VRAM needed for local LLM inference
def estimate_vram(model_params_B, quant_bits, context_len, batch_size=1):
    weight_gb = model_params_B * (quant_bits / 8)
    kv_cache_gb = context_len * 0.001  # rough: ~1 MB per token
    overhead_gb = 2.0  # activations, CUDA context, buffers
    return weight_gb + kv_cache_gb + overhead_gb

# Llama 3.1 8B at Q4 with 8K context
print(f"8B Q4 8K: {estimate_vram(8, 4, 8192):.1f} GB")  # ~12 GB
print(f"8B FP16 8K: {estimate_vram(8, 16, 8192):.1f} GB")  # ~26 GB
print(f"70B Q4 4K: {estimate_vram(70, 4, 4096):.1f} GB")  # ~41 GB
GPUVRAMWhat Fits (Q4 quantization)
RTX 306012 GB7-8B params (Llama 3.1 8B, Mistral 7B)
RTX 407012 GB7-8B params + moderate context
RTX 408016 GB13B params or 7B with long context
RTX 409024 GB34B params or Mixtral 8x7B MoE
RTX 6000 Ada48 GB70B params (Llama 3 70B Q4)
A10040/80 GB70B FP16, 180B+ Q4
Mac M2 Ultra192 GB (unified)405B Q4 (slow but fits)

GPU Architecture: What Determines Speed

For a model that fits in VRAM, these factors determine tokens/second:

Architectural FactorWhat It ControlsImpact on Speed
Memory BandwidthHow fast data moves VRAM → GPUDominant factor — most inference is memory-bound
Tensor CoresMatrix multiplication throughputAffects prompt processing (prefill) and batched inference
L2 Cache SizeReduces expensive VRAM round-tripsLower latency per token
Clock SpeedOverall compute rateModest effect (5-10%)
Arch GenerationNew instruction support (FP8, FP4)Efficiency gains on newer architectures

The Memory Bandwidth Formula

Memory bandwidth is the primary bottleneck for single-user local inference:

Tokens/secondMemory Bandwidth (GB/s)Model Size in VRAM (GB)×1.2\text{Tokens/second} \approx \frac{\text{Memory Bandwidth (GB/s)}}{\text{Model Size in VRAM (GB)} \times 1.2}

The 1.2x overhead accounts for KV cache reads/writes during autoregressive generation.

python
# Token speed estimation
def estimate_tps(bandwidth_gbps, model_gb):
    return bandwidth_gbps / (model_gb * 1.2)

# RTX 4090 (1008 GB/s) with Llama 8B Q4 (~5 GB)
print(f"4090: {estimate_tps(1008, 5):.0f} tok/s")   # ~168 tok/s

# RTX 3070 (448 GB/s) with Llama 8B Q4 (~5 GB)
print(f"3070: {estimate_tps(448, 5):.0f} tok/s")    #  ~75 tok/s

# RTX 4070 (504 GB/s) with Llama 8B Q4 (~5 GB)
print(f"4070: {estimate_tps(504, 5):.0f} tok/s")    #  ~84 tok/s

Real-World Performance Comparison

GPUVRAMBandwidthTensor CoresL2 CacheLlama 3.1 8B Q4 (tok/s)Gen
RTX 30708 GB448 GB/s184 (3rd gen)4 MB~75Ampere
RTX 308010 GB760 GB/s272 (3rd gen)5 MB~110Ampere
RTX 407012 GB504 GB/s184 (4th gen)36 MB~85Ada
RTX 408016 GB717 GB/s304 (4th gen)64 MB~115Ada
RTX 409024 GB1008 GB/s512 (4th gen)72 MB~165Ada
Mac M2 Ultra192 GB800 GB/sN/A (ANE)~25Apple

Key insight: The RTX 4070 has more VRAM (12 GB) than the RTX 3080 (10 GB), but the RTX 3080 is faster (110 vs 85 tok/s) because of its higher memory bandwidth (760 vs 504 GB/s). VRAM doesn't make you fast — it makes you capable.

When VRAM Becomes the Speed Bottleneck

VRAM indirectly affects speed in these scenarios:

ScenarioProblemImpact
Model doesn't fitFalls back to CPU/RAM or fails100-1000x slower
VRAM is nearly fullFragmentation causes extra allocations10-30% slower
GQA/Long contextKV cache competes with weightsLimits max context length
Multi-GPU splitLayer distribution overhead5-15% slower than single GPU

Decision Framework

Practical Recommendations

For 7-8B models (Mistral, Llama 3.1, Qwen 2.5):

  • Minimum: 8 GB VRAM (Q4 quantization, short context)
  • Comfortable: 12-16 GB VRAM (Q4/Q5 with 8K+ context)
  • Ideal: 16-24 GB VRAM (Q6/Q8, long context, prompt caching)

For 13-34B models:

  • Minimum: 16 GB VRAM (Q4, short context)
  • Comfortable: 24 GB VRAM (Q4, 4K+ context)
  • Ideal: 48+ GB VRAM (Q6, full context)

For 70B+ models:

  • Minimum: 48 GB VRAM (Q4, ~2K context) or dual 24 GB GPUs
  • Comfortable: 80 GB VRAM (Q4, 4K+ context)
  • Ideal: A100/H100 (80 GB) or Mac Studio (192 GB unified)

Summary

PriorityFactorWhy It Matters
1stVRAM CapacityDetermines if you can run the model at all
2ndMemory BandwidthDetermines how fast (dominant bottleneck for inference)
3rdArchitecture GenerationEfficiency, new precision formats, cache improvements
4thCompute (CUDA/Tensor Cores)Matters for prompt prefill, batched inference, and training — not single-token decode

The rule of thumb: Buy the most VRAM your budget allows. Among GPUs with similar VRAM, pick the one with higher memory bandwidth. Architecture generation is the tiebreaker when VRAM and bandwidth are similar.

Learn more at r/LocalLLaMA GPU Guide and llama.cpp hardware discussion.