Concept #37Hardsystem-design

How would you evaluate if your RAG system is better than fine-tuning?

#gen-ai#system-design#rag#fine-tuning

Answer

Evaluating RAG vs Fine-tuning

This is a structured A/B evaluation, not just a qualitative comparison. You need metrics, a test set, and a clear decision framework.

Build a Golden Evaluation Set First

python
# 50-100 hand-crafted question-answer pairs specific to your domain
golden_set = [
    {
        "question": "What is the return policy for electronics?",
        "expected_answer": "Electronics can be returned within 30 days with original packaging.",
        "answer_contains": ["30 days", "original packaging"],
        "source_doc": "returns_policy.pdf",
    },
    # ... more pairs
]

Automated Evaluation with RAGAS

python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

def evaluate_system(system_fn, golden_set: list[dict]) -> dict:
    results = []
    for item in golden_set:
        output = system_fn(item["question"])
        results.append({
            "question": item["question"],
            "answer": output["answer"],
            "contexts": output.get("contexts", []),
            "ground_truth": item["expected_answer"],
        })

    dataset = Dataset.from_list(results)
    scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
    return scores

# Evaluate RAG
rag_scores = evaluate_system(rag_pipeline.query, golden_set)

# Evaluate Fine-tuned model (no retrieval)
ft_scores = evaluate_system(finetuned_model.query, golden_set)

print(f"RAG:       faithfulness={rag_scores['faithfulness']:.3f}, relevancy={rag_scores['answer_relevancy']:.3f}")
print(f"Fine-tuned: faithfulness={ft_scores['faithfulness']:.3f}, relevancy={ft_scores['answer_relevancy']:.3f}")

Business Metrics to Compare

python
import time

def benchmark_systems(queries: list[str], n_runs: int = 10) -> dict:
    metrics = {"rag": {}, "fine_tuned": {}}

    for system_name, system_fn in [("rag", rag_fn), ("fine_tuned", ft_fn)]:
        latencies = []
        for query in queries:
            start = time.perf_counter()
            result = system_fn(query)
            latencies.append(time.perf_counter() - start)

        metrics[system_name] = {
            "p50_latency_ms": sorted(latencies)[len(latencies) // 2] * 1000,
            "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)] * 1000,
            "avg_tokens": measure_avg_tokens(queries, system_fn),
            "cost_per_1k_queries": estimate_cost(queries, system_fn),
        }
    return metrics

Decision Framework

MetricRAG AdvantageFine-tuning Advantage
Freshness✅ Real-time updates❌ Stale until retrained
Factual accuracy✅ Grounded in docs❌ Can hallucinate
Source traceability✅ Shows source chunks❌ Black box
Latency❌ Retrieval adds 200-500ms✅ Direct generation
Style/format consistency❌ Depends on prompt✅ Baked into weights
Domain jargon❌ May misinterpret✅ Learned from domain data
Cost per query❌ Embedding + retrieval✅ Shorter prompts
Maintenance❌ Index management❌ Periodic retraining

When RAG Wins

  • Knowledge changes frequently (product specs, policies, pricing)
  • Factual accuracy and source attribution are required
  • You don't have 1000+ labelled training examples

When Fine-tuning Wins

  • Consistent output format/style is critical
  • Domain jargon the base model doesn't know
  • Latency is the bottleneck and retrieval adds too much

Best outcome: RAG + Fine-tuning together. Fine-tune for style/format, use RAG for factual grounding. This combination consistently outperforms either alone in production.