Concept #9Hardgen-ai-fundamentals

How would you evaluate RAG system performance?

#gen-ai#rag#evaluation

Answer

Evaluating RAG Systems

Evaluating RAG is harder than evaluating classification — you need to assess both the retrieval quality and the generation quality.

The Two Evaluation Dimensions

Retrieval quality — did the retriever find the right chunks? Generation quality — did the LLM produce a correct, grounded answer from the chunks?

RAGAS Framework

RAGAS is the standard evaluation framework for RAG. It computes 4 key metrics using an LLM-as-judge approach (no human labels needed for the metrics themselves):

python
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

# Your RAG outputs
data = {
    "question": ["What is RAG?", "What is fine-tuning?"],
    "answer": ["RAG retrieves external docs to ground LLM answers.", "Fine-tuning updates model weights."],
    "contexts": [
        ["RAG stands for Retrieval-Augmented Generation..."],
        ["Fine-tuning is the process of training..."]
    ],
    "ground_truth": ["RAG retrieves context to reduce hallucinations.", "Fine-tuning adapts model weights."]
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision])
print(results)

RAGAS Metrics Explained

MetricMeasuresFormulaGood Score
FaithfulnessIs the answer supported by the retrieved context?% of answer claims in context> 0.9
Answer RelevancyIs the answer relevant to the question?Cosine sim of generated questions to original> 0.8
Context RecallWere all ground-truth facts retrieved?% of GT facts in retrieved context> 0.8
Context PrecisionAre retrieved chunks actually useful?% of retrieved chunks that contain GT info> 0.7

Additional Retrieval Metrics

python
# Retrieval-specific metrics
# Hit Rate @ K: was the correct chunk in top-K?
hit_rate = sum(1 for q in queries if correct_chunk in retrieve(q, k=5)) / len(queries)

# MRR: how highly was the correct chunk ranked?
mrr = sum(1 / rank for rank, doc in enumerate(retrieved_docs, 1) if doc == correct_doc) / len(queries)

Building a Evaluation Dataset

python
# Golden dataset approach — the most reliable method
golden_qa_pairs = [
    {
        "question": "What is the refund policy?",
        "ground_truth": "Customers can request refunds within 30 days of purchase.",
        "source_chunk_id": "handbook_page_12"
    },
    # ... 50-100 more pairs
]

Production approach: Build a golden eval set of 50–100 question/answer pairs with known correct chunks. Run this suite on every code change — treat it like unit tests for your RAG system.