How would you evaluate RAG system performance?

Question

Accepted Answer

## Evaluating RAG Systems

Evaluating RAG is harder than evaluating classification — you need to assess both the retrieval quality and the generation quality.

### The Two Evaluation Dimensions

**Retrieval quality** — did the retriever find the right chunks?
**Generation quality** — did the LLM produce a correct, grounded answer from the chunks?

### RAGAS Framework

[RAGAS](https://docs.ragas.io) is the standard evaluation framework for RAG. It computes 4 key metrics using an LLM-as-judge approach (no human labels needed for the metrics themselves):

```python
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

# Your RAG outputs
data = {
    "question": ["What is RAG?", "What is fine-tuning?"],
    "answer": ["RAG retrieves external docs to ground LLM answers.", "Fine-tuning updates model weights."],
    "contexts": [
        ["RAG stands for Retrieval-Augmented Generation..."],
        ["Fine-tuning is the process of training..."]
    ],
    "ground_truth": ["RAG retrieves context to reduce hallucinations.", "Fine-tuning adapts model weights."]
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision])
print(results)
```

### RAGAS Metrics Explained

| Metric | Measures | Formula | Good Score |
|--------|----------|---------|------------|
| **Faithfulness** | Is the answer supported by the retrieved context? | % of answer claims in context | > 0.9 |
| **Answer Relevancy** | Is the answer relevant to the question? | Cosine sim of generated questions to original | > 0.8 |
| **Context Recall** | Were all ground-truth facts retrieved? | % of GT facts in retrieved context | > 0.8 |
| **Context Precision** | Are retrieved chunks actually useful? | % of retrieved chunks that contain GT info | > 0.7 |

### Additional Retrieval Metrics

```python
# Retrieval-specific metrics
# Hit Rate @ K: was the correct chunk in top-K?
hit_rate = sum(1 for q in queries if correct_chunk in retrieve(q, k=5)) / len(queries)

# MRR: how highly was the correct chunk ranked?
mrr = sum(1 / rank for rank, doc in enumerate(retrieved_docs, 1) if doc == correct_doc) / len(queries)
```

### Building a Evaluation Dataset

```python
# Golden dataset approach — the most reliable method
golden_qa_pairs = [
    {
        "question": "What is the refund policy?",
        "ground_truth": "Customers can request refunds within 30 days of purchase.",
        "source_chunk_id": "handbook_page_12"
    },
    # ... 50-100 more pairs
]
```

> **Production approach:** Build a golden eval set of 50–100 question/answer pairs with known correct chunks. Run this suite on every code change — treat it like unit tests for your RAG system.

How would you evaluate RAG system performance?

Answer

Evaluating RAG Systems

The Two Evaluation Dimensions

RAGAS Framework

RAGAS Metrics Explained

Additional Retrieval Metrics

Building a Evaluation Dataset

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Metric	Measures	Formula	Good Score
Faithfulness	Is the answer supported by the retrieved context?	% of answer claims in context	> 0.9
Answer Relevancy	Is the answer relevant to the question?	Cosine sim of generated questions to original	> 0.8
Context Recall	Were all ground-truth facts retrieved?	% of GT facts in retrieved context	> 0.8
Context Precision	Are retrieved chunks actually useful?	% of retrieved chunks that contain GT info	> 0.7