Concept #86Mediumextended-ai-concepts

What is CAG vs RAG?

#gen-ai#rag

Answer

CAG vs RAG

CAG (Cache-Augmented Generation) pre-loads the entire knowledge base into the model's cached context. RAG dynamically retrieves relevant chunks at query time. They're complementary approaches with different tradeoffs.

Core Difference

RAGCAG
ApproachRetrieve relevant chunks per queryPre-cache entire knowledge base
Retrieval stepYes — vector searchNo — context already loaded
AccuracyDepends on retrieval qualityModel sees all context
Retrieval missPossibleImpossible (all context loaded)
ScaleUnlimitedLimited by context window
LatencyHigher (retrieval adds time)Lower after cache warm-up

RAG

python
def rag_answer(question: str) -> str:
    # Dynamic retrieval at query time
    chunks = vector_db.search(question, top_k=5)
    context = "\n".join(chunks)
    return llm.invoke(f"Context: {context}\n\nQ: {question}")

CAG with Anthropic Prompt Caching

python
import anthropic

client = anthropic.Anthropic()

# Load entire knowledge base once
knowledge_base = open("docs/all_company_docs.txt").read()

def cag_answer(question: str) -> str:
    return client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": knowledge_base,
                "cache_control": {"type": "ephemeral"}  # Cache for 5 min
            }
        ],
        messages=[{"role": "user", "content": question}]
    ).content[0].text

Prompt caching means the knowledge base is only processed once — subsequent queries use the cached KV state at ~10% the normal token cost.

When to Choose Each

Use RAGUse CAG
Large knowledge base (millions of docs)Small-to-medium KB fits in context
Infrequent queriesFrequent queries (cache amortizes cost)
Dynamic / frequently updated contentRelatively stable content
Need precise retrievalHigh accuracy required, no retrieval misses
Multi-source knowledgeSingle consolidated knowledge base

Cost Comparison (Approximate)

ScenarioRAGCAG
50K token KB, 100 queriesEmbedding + small prompts × 100Large prompt × 1 + cached × 99
Large KB, few queriesEconomicalExpensive (context too large)
Small KB, many queriesRetrieval overhead × NOne-time cache, cheap thereafter

Hybrid Approach

Many production systems combine both:

  • CAG for core knowledge always needed (company policies, product specs)
  • RAG for large supplemental document stores (support tickets, blogs)

Key Insight

CAG trades context window space for retrieval accuracy. It works best when your knowledge base is small enough to fit in context AND you have enough queries to amortize the initial cache cost.