Answer
RAG Failure Modes & Mitigations
RAG systems fail in predictable ways. Knowing these failure modes — and their fixes — is essential for production deployments.
The 5 Core RAG Failure Modes
1. Retrieval Failures — Wrong chunks retrieved
Symptoms: Model says "I don't know" but the answer is in the documents. Or model gives a wrong answer from an irrelevant chunk.
Causes & Fixes:
| Cause | Fix |
|---|---|
| Chunk too large (dilutes signal) | Reduce chunk size to 256–512 tokens |
| Poor embedding model | Switch to a better model (BGE-Large, E5-Large) |
| Query-document vocabulary mismatch | Add BM25 hybrid search |
| Single-step retrieval misses context | Use HyDE (hypothetical document embeddings) |
python# HyDE: generate a hypothetical answer, embed it, retrieve with that from openai import OpenAI client = OpenAI() def hyde_retrieve(question, vectorstore, k=5): # Generate a hypothetical document hyp_doc = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": f"Write a paragraph that would answer: {question}"}] ).choices[0].message.content # Retrieve using the hypothetical document return vectorstore.similarity_search(hyp_doc, k=k)
2. Context Window Overflow — Too many chunks exceed LLM context
Fix: Rerank retrieved chunks and keep only top-3; use a reranker model.
pythonfrom sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") def rerank(query, chunks, top_k=3): pairs = [(query, chunk.page_content) for chunk in chunks] scores = reranker.predict(pairs) ranked = sorted(zip(scores, chunks), key=lambda x: x[0], reverse=True) return [chunk for _, chunk in ranked[:top_k]]
3. Hallucination on Retrieved Content — LLM ignores or distorts context
Cause: Model "prefers" its parametric knowledge over context. Temperature too high.
Fix:
- Set for factual Q&Atext
temperature=0 - Add explicit grounding instruction: "Answer ONLY using the context below. If not in context, say 'I don't know'."
- Evaluate with RAGAS metrictext
faithfulness
4. Chunking Breaks Semantic Units
Cause: Fixed-size chunking splits a table or code block mid-way.
Fix: Use semantic/document-aware chunking.
pythonfrom langchain.text_splitter import MarkdownHeaderTextSplitter # Split on markdown headers — keeps sections intact splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[ ("#", "H1"), ("##", "H2"), ("###", "H3") ]) chunks = splitter.split_text(markdown_document)
5. Stale Index — Documents updated but vector store not refreshed
Fix: Implement incremental indexing with document hashing.
Failure Mode Summary
| Failure Mode | Detection | Fix |
|---|---|---|
| Wrong chunks retrieved | Low context recall (RAGAS) | Smaller chunks, hybrid search, HyDE |
| Irrelevant chunks in top-K | Low context precision | Reranker, MMR retrieval |
| LLM ignores context | Low faithfulness (RAGAS) | Prompt grounding, temperature=0 |
| Chunks split mid-thought | Manual inspection | Semantic chunking |
| Stale answers | User reports | Incremental indexing pipeline |
Most common root cause in practice: The retrieved chunks are technically relevant but don't contain the specific sentence needed to answer the question. Fix: smaller chunks with more overlap, or a reranker.