Explain RAG. When would you use it over fine-tuning?

Question

Accepted Answer

## RAG vs Fine-tuning

RAG and fine-tuning are both ways to customise an LLM, but they solve different problems. Understanding the distinction is one of the most common Gen AI learning topics.

### Core Difference

* **RAG** — gives the model access to *external knowledge* at inference time (retrieval)
* **Fine-tuning** — changes the model's *weights* to alter its behaviour, style, or embedded knowledge

### Comparison Table

| Feature | RAG | Fine-tuning |
|---------|-----|-------------|
| **Knowledge source** | External documents retrieved at runtime | Baked into model weights |
| **Knowledge freshness** | Real-time (update the vector store) | Stale until retrained |
| **Training required** | No | Yes (GPU compute) |
| **Cost** | Embedding + retrieval costs | Training cost (one-time) |
| **Traceability** | Can show source documents | Black box |
| **Hallucination risk** | Lower (grounded in retrieved docs) | Higher for factual claims |
| **Domain adaptation** | Good for factual Q&A | Good for style, format, task behaviour |
| **Data needed** | Documents to index | Labelled training examples |
| **Best for** | "What does our policy say about X?" | "Write code in our internal style" |

### RAG Example

```python
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA

# Index your documents once
vectorstore = Chroma.from_documents(documents, OpenAIEmbeddings())

# Retrieve + generate at query time
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)

answer = qa_chain.invoke("What is our refund policy?")
print(answer["result"])
```

### When to Use What

**Use RAG when:**
- Knowledge changes frequently (product docs, policies, news)
- You need citations/traceability
- You want to avoid hallucinations on factual questions
- Your data is confidential and can't be used for training

**Use fine-tuning when:**
- You need a specific output format/style consistently
- You want to teach the model a proprietary task (e.g. domain-specific code patterns)
- You want to distil a long system prompt into weights (saves tokens per request)

**Use both (RAG + Fine-tuning):**
- Fine-tune for style and format
- RAG for up-to-date factual grounding
- This combination outperforms either approach alone in production

> **Best practice in production:** Combine both — fine-tune for style and format, RAG for dynamic knowledge. This is called **RAG + Fine-tuning** and outperforms either approach alone.

Explain RAG. When would you use it over fine-tuning?

Answer

RAG vs Fine-tuning

Core Difference

Comparison Table

RAG Example

When to Use What

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Feature	RAG	Fine-tuning
Knowledge source	External documents retrieved at runtime	Baked into model weights
Knowledge freshness	Real-time (update the vector store)	Stale until retrained
Training required	No	Yes (GPU compute)
Cost	Embedding + retrieval costs	Training cost (one-time)
Traceability	Can show source documents	Black box
Hallucination risk	Lower (grounded in retrieved docs)	Higher for factual claims
Domain adaptation	Good for factual Q&A	Good for style, format, task behaviour
Data needed	Documents to index	Labelled training examples
Best for	"What does our policy say about X?"	"Write code in our internal style"