Answer
Types of RAG (Retrieval-Augmented Generation)
RAG has evolved significantly from a simple retrieve-and-generate pattern into a family of specialized architectures.
Core RAG Types
| Type | Description | Key Innovation |
|---|---|---|
| Basic RAG | Simple retrieve → generate pipeline | Foundation |
| Advanced RAG | Improved retrieval (pre/post processing) | Better retrieval |
| Modular RAG | Plug-and-play components | Flexibility |
| Agentic RAG | LLM controls retrieval decisions | Autonomy |
| Graph RAG | Knowledge graph + retrieval | Relationships |
| Self-RAG | Model decides when to retrieve | Efficiency |
| Corrective RAG | Evaluates and corrects retrieval | Quality |
| CAG | Cache full context instead of retrieval | Speed |
| Speculative RAG | Draft then refine | Accuracy |
1. Basic RAG
pythondef basic_rag(query: str) -> str: chunks = vector_db.search(query, k=3) context = "\n".join(chunks) return llm.invoke(f"Context: {context}\n\nQ: {query}")
One fixed pipeline — always retrieves, always generates.
2. Advanced RAG
Adds pre-retrieval and post-retrieval improvements:
pythondef advanced_rag(query: str) -> str: # PRE-RETRIEVAL: Query transformation expanded_query = llm.invoke(f"Rewrite for better retrieval: {query}") hypothetical = llm.invoke(f"Generate a hypothetical answer to: {query}") # RETRIEVAL: Hybrid search semantic_results = vector_db.search(expanded_query, k=5) keyword_results = bm25_index.search(expanded_query, k=5) combined = merge_results(semantic_results, keyword_results) # POST-RETRIEVAL: Reranking reranked = cohere_reranker.rerank(query, combined, top_n=3) context = "\n".join(reranked) return llm.invoke(f"Context: {context}\n\nQ: {query}")
3. Self-RAG
Model generates special tokens to decide whether to retrieve:
pythondef self_rag(query: str) -> str: # Step 1: Does this need retrieval? decision = llm.invoke(f"Does answering '{query}' require looking up documents? YES/NO") if "YES" in decision.upper(): chunks = vector_db.search(query, k=4) # Step 2: Is the retrieved content relevant? for chunk in chunks: relevance = llm.invoke(f"Is this relevant to '{query}'? {chunk[:200]} YES/NO/PARTIALLY") # Step 3: Is the answer supported by retrieved content? response = llm.invoke(f"Context: {chunks}\n\nQ: {query}") supported = llm.invoke(f"Is this answer fully supported by the context? {response}") return response else: return llm.invoke(query) # Direct answer from training
4. Corrective RAG (CRAG)
Evaluates retrieval quality and corrects if needed:
pythondef corrective_rag(query: str) -> str: # 1. Retrieve chunks = vector_db.search(query, k=4) # 2. Evaluate quality quality = llm.invoke( f"Rate relevance 1-5 of these chunks for '{query}':\n{chunks}" ) if float(quality) < 3: # 3. If poor quality — use web search as fallback web_results = web_search(query) chunks = chunks + web_results # Combine # 4. Generate with (corrected) context return llm.invoke(f"Context: {chunks}\n\nQ: {query}")
5. HyDE (Hypothetical Document Embeddings)
Generate a fake answer first, embed it, use that embedding for retrieval:
pythondef hyde_rag(query: str) -> str: # Generate hypothetical answer hypothetical_answer = llm.invoke( f"Write a hypothetical document that would answer: {query}" ) # Embed the hypothetical answer (not the query) hyp_embedding = embedder.encode(hypothetical_answer) # Search with hypothetical embedding — finds real docs that match chunks = vector_db.search_by_embedding(hyp_embedding, k=4) # Generate real answer return llm.invoke(f"Context: {chunks}\n\nQ: {query}")
Choosing the Right RAG Type
| Use Case | Recommended |
|---|---|
| Simple Q&A | Basic RAG |
| High accuracy needed | Advanced RAG (hybrid + reranking) |
| Cost optimization | Self-RAG or CAG |
| Complex entity queries | Graph RAG |
| Unreliable retrieval | Corrective RAG |
| Autonomous research | Agentic RAG |
| Dense technical docs | HyDE |
| Stable small knowledge base | CAG |
RAG Evaluation Metrics
| Metric | Measures |
|---|---|
| Faithfulness | Is answer grounded in retrieved context? |
| Answer relevance | Does answer address the question? |
| Context precision | Are retrieved chunks actually useful? |
| Context recall | Were all relevant chunks retrieved? |