Concept #85Mediumextended-ai-concepts

What is RAG?

#gen-ai#rag

Answer

What is RAG?

RAG (Retrieval-Augmented Generation) is an AI architecture that retrieves relevant information from an external knowledge base and uses it to ground the LLM's response — reducing hallucination and enabling up-to-date answers.

The Two Problems RAG Solves

  1. Hallucination — LLMs invent plausible-sounding but false facts
  2. Knowledge cutoff — LLMs don't know recent or proprietary information

RAG Pipeline

text
User Question
[1] Embed question → vector
[2] Search vector DB → find similar document chunks
[3] Retrieve top-k chunks
[4] Augment prompt: question + retrieved context
[5] LLM generates grounded answer

Complete RAG Implementation

python
from anthropic import Anthropic
import chromadb

client = Anthropic()
chroma = chromadb.Client()
collection = chroma.create_collection("kb")

# Index documents once
docs = [
    "Claude was created by Anthropic, founded in 2021.",
    "RAG stands for Retrieval-Augmented Generation.",
    "Vector databases enable semantic similarity search."
]
collection.add(documents=docs, ids=["d1", "d2", "d3"])

def rag(question: str) -> str:
    # Retrieve
    results = collection.query(query_texts=[question], n_results=2)
    context = "\n".join(results['documents'][0])

    # Generate with context
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content":
            f"Answer using only this context:\n{context}\n\nQuestion: {question}"}]
    )
    return response.content[0].text

print(rag("Who made Claude?"))
# → "Claude was created by Anthropic, founded in 2021."

RAG vs Pure LLM

Pure LLMRAG
KnowledgeTraining data onlyExternal documents
FreshnessCutoff dateCurrent / real-time
HallucinationHigh riskLower (grounded)
ControllableNoYes — you own the docs
CostLowerHigher (embedding + retrieval)
CustomizableFine-tune neededSwap docs anytime

Key Design Decisions

DecisionOptions
ChunkingFixed-size (512 tokens), semantic, sentence-level
Embedding modelOpenAI ada-002, sentence-transformers, Cohere
Vector storeChroma, Pinecone, pgvector
RetrievalSemantic, keyword (BM25), hybrid
RerankingCohere Rerank, cross-encoder
top-k3-5 chunks typically

Common RAG Improvements

  • Hybrid search — semantic + keyword (BM25) retrieval
  • Reranking — use a cross-encoder to reorder results
  • Query expansion — rewrite query before retrieval
  • Parent-child chunks — retrieve small chunks, return larger parent context
  • HyDE — generate a hypothetical answer first, use it to retrieve