Answer
CAG vs RAG
CAG (Cache-Augmented Generation) pre-loads the entire knowledge base into the model's cached context. RAG dynamically retrieves relevant chunks at query time. They're complementary approaches with different tradeoffs.
Core Difference
| RAG | CAG | |
|---|---|---|
| Approach | Retrieve relevant chunks per query | Pre-cache entire knowledge base |
| Retrieval step | Yes — vector search | No — context already loaded |
| Accuracy | Depends on retrieval quality | Model sees all context |
| Retrieval miss | Possible | Impossible (all context loaded) |
| Scale | Unlimited | Limited by context window |
| Latency | Higher (retrieval adds time) | Lower after cache warm-up |
RAG
pythondef rag_answer(question: str) -> str: # Dynamic retrieval at query time chunks = vector_db.search(question, top_k=5) context = "\n".join(chunks) return llm.invoke(f"Context: {context}\n\nQ: {question}")
CAG with Anthropic Prompt Caching
pythonimport anthropic client = anthropic.Anthropic() # Load entire knowledge base once knowledge_base = open("docs/all_company_docs.txt").read() def cag_answer(question: str) -> str: return client.messages.create( model="claude-opus-4-6", max_tokens=1024, system=[ { "type": "text", "text": knowledge_base, "cache_control": {"type": "ephemeral"} # Cache for 5 min } ], messages=[{"role": "user", "content": question}] ).content[0].text
Prompt caching means the knowledge base is only processed once — subsequent queries use the cached KV state at ~10% the normal token cost.
When to Choose Each
| Use RAG | Use CAG |
|---|---|
| Large knowledge base (millions of docs) | Small-to-medium KB fits in context |
| Infrequent queries | Frequent queries (cache amortizes cost) |
| Dynamic / frequently updated content | Relatively stable content |
| Need precise retrieval | High accuracy required, no retrieval misses |
| Multi-source knowledge | Single consolidated knowledge base |
Cost Comparison (Approximate)
| Scenario | RAG | CAG |
|---|---|---|
| 50K token KB, 100 queries | Embedding + small prompts × 100 | Large prompt × 1 + cached × 99 |
| Large KB, few queries | Economical | Expensive (context too large) |
| Small KB, many queries | Retrieval overhead × N | One-time cache, cheap thereafter |
Hybrid Approach
Many production systems combine both:
- CAG for core knowledge always needed (company policies, product specs)
- RAG for large supplemental document stores (support tickets, blogs)
Key Insight
CAG trades context window space for retrieval accuracy. It works best when your knowledge base is small enough to fit in context AND you have enough queries to amortize the initial cache cost.