What is CAG vs RAG?

Question

What is CAG vs RAG?

Accepted Answer

## CAG vs RAG

**CAG (Cache-Augmented Generation)** pre-loads the entire knowledge base into the model's cached context. **RAG** dynamically retrieves relevant chunks at query time. They're complementary approaches with different tradeoffs.

### Core Difference

| | RAG | CAG |
|--|-----|-----|
| **Approach** | Retrieve relevant chunks per query | Pre-cache entire knowledge base |
| **Retrieval step** | Yes — vector search | No — context already loaded |
| **Accuracy** | Depends on retrieval quality | Model sees all context |
| **Retrieval miss** | Possible | Impossible (all context loaded) |
| **Scale** | Unlimited | Limited by context window |
| **Latency** | Higher (retrieval adds time) | Lower after cache warm-up |

### RAG

```python
def rag_answer(question: str) -> str:
    # Dynamic retrieval at query time
    chunks = vector_db.search(question, top_k=5)
    context = "
".join(chunks)
    return llm.invoke(f"Context: {context}

Q: {question}")
```

### CAG with Anthropic Prompt Caching

```python
import anthropic

client = anthropic.Anthropic()

# Load entire knowledge base once
knowledge_base = open("docs/all_company_docs.txt").read()

def cag_answer(question: str) -> str:
    return client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": knowledge_base,
                "cache_control": {"type": "ephemeral"}  # Cache for 5 min
            }
        ],
        messages=[{"role": "user", "content": question}]
    ).content[0].text
```

**Prompt caching** means the knowledge base is only processed once — subsequent queries use the cached KV state at ~10% the normal token cost.

### When to Choose Each

| Use RAG | Use CAG |
|---------|---------|
| Large knowledge base (millions of docs) | Small-to-medium KB fits in context |
| Infrequent queries | Frequent queries (cache amortizes cost) |
| Dynamic / frequently updated content | Relatively stable content |
| Need precise retrieval | High accuracy required, no retrieval misses |
| Multi-source knowledge | Single consolidated knowledge base |

### Cost Comparison (Approximate)

| Scenario | RAG | CAG |
|---------|-----|-----|
| 50K token KB, 100 queries | Embedding + small prompts × 100 | Large prompt × 1 + cached × 99 |
| Large KB, few queries | Economical | Expensive (context too large) |
| Small KB, many queries | Retrieval overhead × N | One-time cache, cheap thereafter |

### Hybrid Approach

Many production systems combine both:
- **CAG** for core knowledge always needed (company policies, product specs)
- **RAG** for large supplemental document stores (support tickets, blogs)

### Key Insight

> CAG trades context window space for retrieval accuracy. It works best when your knowledge base is small enough to fit in context AND you have enough queries to amortize the initial cache cost.

What is CAG vs RAG?

Answer

CAG vs RAG

Core Difference

RAG

CAG with Anthropic Prompt Caching

When to Choose Each

Cost Comparison (Approximate)

Hybrid Approach

Key Insight

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

	RAG	CAG
Approach	Retrieve relevant chunks per query	Pre-cache entire knowledge base
Retrieval step	Yes — vector search	No — context already loaded
Accuracy	Depends on retrieval quality	Model sees all context
Retrieval miss	Possible	Impossible (all context loaded)
Scale	Unlimited	Limited by context window
Latency	Higher (retrieval adds time)	Lower after cache warm-up

Use RAG	Use CAG
Large knowledge base (millions of docs)	Small-to-medium KB fits in context
Infrequent queries	Frequent queries (cache amortizes cost)
Dynamic / frequently updated content	Relatively stable content
Need precise retrieval	High accuracy required, no retrieval misses
Multi-source knowledge	Single consolidated knowledge base

Scenario	RAG	CAG
50K token KB, 100 queries	Embedding + small prompts × 100	Large prompt × 1 + cached × 99
Large KB, few queries	Economical	Expensive (context too large)
Small KB, many queries	Retrieval overhead × N	One-time cache, cheap thereafter