How would you evaluate if your RAG system is better than fine-tuning?
#gen-ai#system-design#rag#fine-tuning
Answer
Evaluating RAG vs Fine-tuning
This is a structured A/B evaluation, not just a qualitative comparison. You need metrics, a test set, and a clear decision framework.
Build a Golden Evaluation Set First
python# 50-100 hand-crafted question-answer pairs specific to your domain golden_set = [ { "question": "What is the return policy for electronics?", "expected_answer": "Electronics can be returned within 30 days with original packaging.", "answer_contains": ["30 days", "original packaging"], "source_doc": "returns_policy.pdf", }, # ... more pairs ]
Automated Evaluation with RAGAS
pythonfrom ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_recall from datasets import Dataset def evaluate_system(system_fn, golden_set: list[dict]) -> dict: results = [] for item in golden_set: output = system_fn(item["question"]) results.append({ "question": item["question"], "answer": output["answer"], "contexts": output.get("contexts", []), "ground_truth": item["expected_answer"], }) dataset = Dataset.from_list(results) scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall]) return scores # Evaluate RAG rag_scores = evaluate_system(rag_pipeline.query, golden_set) # Evaluate Fine-tuned model (no retrieval) ft_scores = evaluate_system(finetuned_model.query, golden_set) print(f"RAG: faithfulness={rag_scores['faithfulness']:.3f}, relevancy={rag_scores['answer_relevancy']:.3f}") print(f"Fine-tuned: faithfulness={ft_scores['faithfulness']:.3f}, relevancy={ft_scores['answer_relevancy']:.3f}")
Business Metrics to Compare
pythonimport time def benchmark_systems(queries: list[str], n_runs: int = 10) -> dict: metrics = {"rag": {}, "fine_tuned": {}} for system_name, system_fn in [("rag", rag_fn), ("fine_tuned", ft_fn)]: latencies = [] for query in queries: start = time.perf_counter() result = system_fn(query) latencies.append(time.perf_counter() - start) metrics[system_name] = { "p50_latency_ms": sorted(latencies)[len(latencies) // 2] * 1000, "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)] * 1000, "avg_tokens": measure_avg_tokens(queries, system_fn), "cost_per_1k_queries": estimate_cost(queries, system_fn), } return metrics
Decision Framework
| Metric | RAG Advantage | Fine-tuning Advantage |
|---|---|---|
| Freshness | ✅ Real-time updates | ❌ Stale until retrained |
| Factual accuracy | ✅ Grounded in docs | ❌ Can hallucinate |
| Source traceability | ✅ Shows source chunks | ❌ Black box |
| Latency | ❌ Retrieval adds 200-500ms | ✅ Direct generation |
| Style/format consistency | ❌ Depends on prompt | ✅ Baked into weights |
| Domain jargon | ❌ May misinterpret | ✅ Learned from domain data |
| Cost per query | ❌ Embedding + retrieval | ✅ Shorter prompts |
| Maintenance | ❌ Index management | ❌ Periodic retraining |
When RAG Wins
- Knowledge changes frequently (product specs, policies, pricing)
- Factual accuracy and source attribution are required
- You don't have 1000+ labelled training examples
When Fine-tuning Wins
- Consistent output format/style is critical
- Domain jargon the base model doesn't know
- Latency is the bottleneck and retrieval adds too much
Best outcome: RAG + Fine-tuning together. Fine-tune for style/format, use RAG for factual grounding. This combination consistently outperforms either alone in production.