What is RAG?

Question

What is RAG?

Accepted Answer

## What is RAG?

**RAG (Retrieval-Augmented Generation)** is an AI architecture that retrieves relevant information from an external knowledge base and uses it to ground the LLM's response — reducing hallucination and enabling up-to-date answers.

### The Two Problems RAG Solves

1. **Hallucination** — LLMs invent plausible-sounding but false facts
2. **Knowledge cutoff** — LLMs don't know recent or proprietary information

### RAG Pipeline

```
User Question
      ↓
[1] Embed question → vector
      ↓
[2] Search vector DB → find similar document chunks
      ↓
[3] Retrieve top-k chunks
      ↓
[4] Augment prompt: question + retrieved context
      ↓
[5] LLM generates grounded answer
```

### Complete RAG Implementation

```python
from anthropic import Anthropic
import chromadb

client = Anthropic()
chroma = chromadb.Client()
collection = chroma.create_collection("kb")

# Index documents once
docs = [
    "Claude was created by Anthropic, founded in 2021.",
    "RAG stands for Retrieval-Augmented Generation.",
    "Vector databases enable semantic similarity search."
]
collection.add(documents=docs, ids=["d1", "d2", "d3"])

def rag(question: str) -> str:
    # Retrieve
    results = collection.query(query_texts=[question], n_results=2)
    context = "
".join(results['documents'][0])

# Generate with context
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content":
            f"Answer using only this context:
{context}

Question: {question}"}]
    )
    return response.content[0].text

print(rag("Who made Claude?"))
# → "Claude was created by Anthropic, founded in 2021."
```

### RAG vs Pure LLM

| | Pure LLM | RAG |
|--|---------|-----|
| **Knowledge** | Training data only | External documents |
| **Freshness** | Cutoff date | Current / real-time |
| **Hallucination** | High risk | Lower (grounded) |
| **Controllable** | No | Yes — you own the docs |
| **Cost** | Lower | Higher (embedding + retrieval) |
| **Customizable** | Fine-tune needed | Swap docs anytime |

### Key Design Decisions

| Decision | Options |
|---------|---------|
| **Chunking** | Fixed-size (512 tokens), semantic, sentence-level |
| **Embedding model** | OpenAI ada-002, sentence-transformers, Cohere |
| **Vector store** | Chroma, Pinecone, pgvector |
| **Retrieval** | Semantic, keyword (BM25), hybrid |
| **Reranking** | Cohere Rerank, cross-encoder |
| **top-k** | 3-5 chunks typically |

### Common RAG Improvements

* **Hybrid search** — semantic + keyword (BM25) retrieval
* **Reranking** — use a cross-encoder to reorder results
* **Query expansion** — rewrite query before retrieval
* **Parent-child chunks** — retrieve small chunks, return larger parent context
* **HyDE** — generate a hypothetical answer first, use it to retrieve

What is RAG?

Answer

What is RAG?

The Two Problems RAG Solves

RAG Pipeline

Complete RAG Implementation

RAG vs Pure LLM

Key Design Decisions

Common RAG Improvements

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

	Pure LLM	RAG
Knowledge	Training data only	External documents
Freshness	Cutoff date	Current / real-time
Hallucination	High risk	Lower (grounded)
Controllable	No	Yes — you own the docs
Cost	Lower	Higher (embedding + retrieval)
Customizable	Fine-tune needed	Swap docs anytime

Decision	Options
Chunking	Fixed-size (512 tokens), semantic, sentence-level
Embedding model	OpenAI ada-002, sentence-transformers, Cohere
Vector store	Chroma, Pinecone, pgvector
Retrieval	Semantic, keyword (BM25), hybrid
Reranking	Cohere Rerank, cross-encoder
top-k	3-5 chunks typically