How would you evaluate RAG system performance?
#gen-ai#rag#evaluation
Answer
Evaluating RAG Systems
Evaluating RAG is harder than evaluating classification — you need to assess both the retrieval quality and the generation quality.
The Two Evaluation Dimensions
Retrieval quality — did the retriever find the right chunks? Generation quality — did the LLM produce a correct, grounded answer from the chunks?
RAGAS Framework
RAGAS is the standard evaluation framework for RAG. It computes 4 key metrics using an LLM-as-judge approach (no human labels needed for the metrics themselves):
pythonfrom ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_recall, context_precision, ) from datasets import Dataset # Your RAG outputs data = { "question": ["What is RAG?", "What is fine-tuning?"], "answer": ["RAG retrieves external docs to ground LLM answers.", "Fine-tuning updates model weights."], "contexts": [ ["RAG stands for Retrieval-Augmented Generation..."], ["Fine-tuning is the process of training..."] ], "ground_truth": ["RAG retrieves context to reduce hallucinations.", "Fine-tuning adapts model weights."] } dataset = Dataset.from_dict(data) results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision]) print(results)
RAGAS Metrics Explained
| Metric | Measures | Formula | Good Score |
|---|---|---|---|
| Faithfulness | Is the answer supported by the retrieved context? | % of answer claims in context | > 0.9 |
| Answer Relevancy | Is the answer relevant to the question? | Cosine sim of generated questions to original | > 0.8 |
| Context Recall | Were all ground-truth facts retrieved? | % of GT facts in retrieved context | > 0.8 |
| Context Precision | Are retrieved chunks actually useful? | % of retrieved chunks that contain GT info | > 0.7 |
Additional Retrieval Metrics
python# Retrieval-specific metrics # Hit Rate @ K: was the correct chunk in top-K? hit_rate = sum(1 for q in queries if correct_chunk in retrieve(q, k=5)) / len(queries) # MRR: how highly was the correct chunk ranked? mrr = sum(1 / rank for rank, doc in enumerate(retrieved_docs, 1) if doc == correct_doc) / len(queries)
Building a Evaluation Dataset
python# Golden dataset approach — the most reliable method golden_qa_pairs = [ { "question": "What is the refund policy?", "ground_truth": "Customers can request refunds within 30 days of purchase.", "source_chunk_id": "handbook_page_12" }, # ... 50-100 more pairs ]
Production approach: Build a golden eval set of 50–100 question/answer pairs with known correct chunks. Run this suite on every code change — treat it like unit tests for your RAG system.