Concept #10Hardgen-ai-fundamentals

What are common failure modes in RAG systems?

#gen-ai#rag

Answer

RAG Failure Modes & Mitigations

RAG systems fail in predictable ways. Knowing these failure modes — and their fixes — is essential for production deployments.

The 5 Core RAG Failure Modes

1. Retrieval Failures — Wrong chunks retrieved

Symptoms: Model says "I don't know" but the answer is in the documents. Or model gives a wrong answer from an irrelevant chunk.

Causes & Fixes:

CauseFix
Chunk too large (dilutes signal)Reduce chunk size to 256–512 tokens
Poor embedding modelSwitch to a better model (BGE-Large, E5-Large)
Query-document vocabulary mismatchAdd BM25 hybrid search
Single-step retrieval misses contextUse HyDE (hypothetical document embeddings)
python
# HyDE: generate a hypothetical answer, embed it, retrieve with that
from openai import OpenAI
client = OpenAI()

def hyde_retrieve(question, vectorstore, k=5):
    # Generate a hypothetical document
    hyp_doc = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Write a paragraph that would answer: {question}"}]
    ).choices[0].message.content

    # Retrieve using the hypothetical document
    return vectorstore.similarity_search(hyp_doc, k=k)

2. Context Window Overflow — Too many chunks exceed LLM context

Fix: Rerank retrieved chunks and keep only top-3; use a reranker model.

python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, chunks, top_k=3):
    pairs = [(query, chunk.page_content) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, chunks), key=lambda x: x[0], reverse=True)
    return [chunk for _, chunk in ranked[:top_k]]

3. Hallucination on Retrieved Content — LLM ignores or distorts context

Cause: Model "prefers" its parametric knowledge over context. Temperature too high.

Fix:

  • Set
    text
    temperature=0
    for factual Q&A
  • Add explicit grounding instruction: "Answer ONLY using the context below. If not in context, say 'I don't know'."
  • Evaluate with RAGAS
    text
    faithfulness
    metric

4. Chunking Breaks Semantic Units

Cause: Fixed-size chunking splits a table or code block mid-way.

Fix: Use semantic/document-aware chunking.

python
from langchain.text_splitter import MarkdownHeaderTextSplitter

# Split on markdown headers — keeps sections intact
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[
    ("#", "H1"), ("##", "H2"), ("###", "H3")
])
chunks = splitter.split_text(markdown_document)

5. Stale Index — Documents updated but vector store not refreshed

Fix: Implement incremental indexing with document hashing.

Failure Mode Summary

Failure ModeDetectionFix
Wrong chunks retrievedLow context recall (RAGAS)Smaller chunks, hybrid search, HyDE
Irrelevant chunks in top-KLow context precisionReranker, MMR retrieval
LLM ignores contextLow faithfulness (RAGAS)Prompt grounding, temperature=0
Chunks split mid-thoughtManual inspectionSemantic chunking
Stale answersUser reportsIncremental indexing pipeline

Most common root cause in practice: The retrieved chunks are technically relevant but don't contain the specific sentence needed to answer the question. Fix: smaller chunks with more overlap, or a reranker.