What are common failure modes in RAG systems?

Question

Accepted Answer

## RAG Failure Modes & Mitigations

RAG systems fail in predictable ways. Knowing these failure modes — and their fixes — is essential for production deployments.

### The 5 Core RAG Failure Modes

#### 1. Retrieval Failures — Wrong chunks retrieved

**Symptoms:** Model says "I don't know" but the answer is in the documents. Or model gives a wrong answer from an irrelevant chunk.

**Causes & Fixes:**

| Cause | Fix |
|-------|-----|
| Chunk too large (dilutes signal) | Reduce chunk size to 256–512 tokens |
| Poor embedding model | Switch to a better model (BGE-Large, E5-Large) |
| Query-document vocabulary mismatch | Add BM25 hybrid search |
| Single-step retrieval misses context | Use HyDE (hypothetical document embeddings) |

```python
# HyDE: generate a hypothetical answer, embed it, retrieve with that
from openai import OpenAI
client = OpenAI()

def hyde_retrieve(question, vectorstore, k=5):
    # Generate a hypothetical document
    hyp_doc = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Write a paragraph that would answer: {question}"}]
    ).choices[0].message.content

# Retrieve using the hypothetical document
    return vectorstore.similarity_search(hyp_doc, k=k)
```

#### 2. Context Window Overflow — Too many chunks exceed LLM context

**Fix:** Rerank retrieved chunks and keep only top-3; use a reranker model.

```python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, chunks, top_k=3):
    pairs = [(query, chunk.page_content) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, chunks), key=lambda x: x[0], reverse=True)
    return [chunk for _, chunk in ranked[:top_k]]
```

#### 3. Hallucination on Retrieved Content — LLM ignores or distorts context

**Cause:** Model "prefers" its parametric knowledge over context. Temperature too high.

**Fix:**
- Set `temperature=0` for factual Q&A
- Add explicit grounding instruction: *"Answer ONLY using the context below. If not in context, say 'I don't know'."*
- Evaluate with RAGAS `faithfulness` metric

#### 4. Chunking Breaks Semantic Units

**Cause:** Fixed-size chunking splits a table or code block mid-way.

**Fix:** Use semantic/document-aware chunking.

```python
from langchain.text_splitter import MarkdownHeaderTextSplitter

# Split on markdown headers — keeps sections intact
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[
    ("#", "H1"), ("##", "H2"), ("###", "H3")
])
chunks = splitter.split_text(markdown_document)
```

#### 5. Stale Index — Documents updated but vector store not refreshed

**Fix:** Implement incremental indexing with document hashing.

### Failure Mode Summary

| Failure Mode | Detection | Fix |
|-------------|-----------|-----|
| Wrong chunks retrieved | Low context recall (RAGAS) | Smaller chunks, hybrid search, HyDE |
| Irrelevant chunks in top-K | Low context precision | Reranker, MMR retrieval |
| LLM ignores context | Low faithfulness (RAGAS) | Prompt grounding, temperature=0 |
| Chunks split mid-thought | Manual inspection | Semantic chunking |
| Stale answers | User reports | Incremental indexing pipeline |

> **Most common root cause in practice:** The retrieved chunks are technically relevant but don't contain the *specific sentence* needed to answer the question. Fix: smaller chunks with more overlap, or a reranker.

What are common failure modes in RAG systems?

Answer

RAG Failure Modes & Mitigations

The 5 Core RAG Failure Modes

1. Retrieval Failures — Wrong chunks retrieved

2. Context Window Overflow — Too many chunks exceed LLM context

3. Hallucination on Retrieved Content — LLM ignores or distorts context

4. Chunking Breaks Semantic Units

5. Stale Index — Documents updated but vector store not refreshed

Failure Mode Summary

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Cause	Fix
Chunk too large (dilutes signal)	Reduce chunk size to 256–512 tokens
Poor embedding model	Switch to a better model (BGE-Large, E5-Large)
Query-document vocabulary mismatch	Add BM25 hybrid search
Single-step retrieval misses context	Use HyDE (hypothetical document embeddings)

Failure Mode	Detection	Fix
Wrong chunks retrieved	Low context recall (RAGAS)	Smaller chunks, hybrid search, HyDE
Irrelevant chunks in top-K	Low context precision	Reranker, MMR retrieval
LLM ignores context	Low faithfulness (RAGAS)	Prompt grounding, temperature=0
Chunks split mid-thought	Manual inspection	Semantic chunking
Stale answers	User reports	Incremental indexing pipeline