Answer
What is RAG?
RAG (Retrieval-Augmented Generation) is an AI architecture that retrieves relevant information from an external knowledge base and uses it to ground the LLM's response — reducing hallucination and enabling up-to-date answers.
The Two Problems RAG Solves
- Hallucination — LLMs invent plausible-sounding but false facts
- Knowledge cutoff — LLMs don't know recent or proprietary information
RAG Pipeline
textUser Question ↓ [1] Embed question → vector ↓ [2] Search vector DB → find similar document chunks ↓ [3] Retrieve top-k chunks ↓ [4] Augment prompt: question + retrieved context ↓ [5] LLM generates grounded answer
Complete RAG Implementation
pythonfrom anthropic import Anthropic import chromadb client = Anthropic() chroma = chromadb.Client() collection = chroma.create_collection("kb") # Index documents once docs = [ "Claude was created by Anthropic, founded in 2021.", "RAG stands for Retrieval-Augmented Generation.", "Vector databases enable semantic similarity search." ] collection.add(documents=docs, ids=["d1", "d2", "d3"]) def rag(question: str) -> str: # Retrieve results = collection.query(query_texts=[question], n_results=2) context = "\n".join(results['documents'][0]) # Generate with context response = client.messages.create( model="claude-opus-4-6", max_tokens=512, messages=[{"role": "user", "content": f"Answer using only this context:\n{context}\n\nQuestion: {question}"}] ) return response.content[0].text print(rag("Who made Claude?")) # → "Claude was created by Anthropic, founded in 2021."
RAG vs Pure LLM
| Pure LLM | RAG | |
|---|---|---|
| Knowledge | Training data only | External documents |
| Freshness | Cutoff date | Current / real-time |
| Hallucination | High risk | Lower (grounded) |
| Controllable | No | Yes — you own the docs |
| Cost | Lower | Higher (embedding + retrieval) |
| Customizable | Fine-tune needed | Swap docs anytime |
Key Design Decisions
| Decision | Options |
|---|---|
| Chunking | Fixed-size (512 tokens), semantic, sentence-level |
| Embedding model | OpenAI ada-002, sentence-transformers, Cohere |
| Vector store | Chroma, Pinecone, pgvector |
| Retrieval | Semantic, keyword (BM25), hybrid |
| Reranking | Cohere Rerank, cross-encoder |
| top-k | 3-5 chunks typically |
Common RAG Improvements
- Hybrid search — semantic + keyword (BM25) retrieval
- Reranking — use a cross-encoder to reorder results
- Query expansion — rewrite query before retrieval
- Parent-child chunks — retrieve small chunks, return larger parent context
- HyDE — generate a hypothetical answer first, use it to retrieve