Walk me through a RAG pipeline architecture.

Question

Accepted Answer

## Building a RAG Pipeline

A RAG (Retrieval-Augmented Generation) pipeline has two phases: **indexing** (offline) and **retrieval + generation** (online).

### Architecture Overview

```mermaid
graph LR
    D[Documents] --> C[Chunking]
    C --> E[Embedding Model]
    E --> V[(Vector Store)]
    Q[User Query] --> QE[Query Embedder]
    QE --> V
    V --> R[Top-K Chunks]
    R --> P[Prompt Builder]
    Q --> P
    P --> L[LLM]
    L --> A[Answer]
```

### Step 1: Document Loading & Chunking

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # ~380 words per chunk
    chunk_overlap=64,     # overlap to avoid losing context at boundaries
    separators=["

", "
", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
```

### Step 2: Embedding & Indexing

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Embed and store all chunks (run once)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
```

### Step 3: Retrieval

```python
retriever = vectorstore.as_retriever(
    search_type="mmr",        # Maximal Marginal Relevance — avoids redundant chunks
    search_kwargs={"k": 5, "fetch_k": 20}
)

docs = retriever.invoke("What is the parental leave policy?")
for doc in docs:
    print(doc.page_content[:200])
```

### Step 4: Generation with Retrieved Context

```python
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template='''Use ONLY the context below to answer the question.
If the answer is not in the context, say "I don't know."

Context:
{context}

Question: {question}

Answer:'''
)

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o", temperature=0),
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt_template},
    return_source_documents=True
)

result = qa_chain.invoke("What is the parental leave policy?")
print(result["result"])
print("Sources:", [d.metadata["source"] for d in result["source_documents"]])
```

### Key Design Decisions

| Decision | Options | Recommendation |
|----------|---------|----------------|
| **Chunk size** | 128–2048 tokens | 512 tokens (balance context vs precision) |
| **Overlap** | 0–20% | 10–15% (avoids boundary loss) |
| **Embedding model** | `text-embedding-3-small`, BGE, E5 | `text-embedding-3-small` (cost-effective) |
| **Vector DB** | Chroma, Pinecone, Weaviate, FAISS | Chroma (local), Pinecone (production) |
| **Top-K** | 3–10 | 5 (balance context length vs noise) |
| **Retrieval strategy** | Cosine similarity, MMR, hybrid | MMR (reduces redundancy) |

> **Production tip:** Always return source documents and display them to users. Traceability is the biggest advantage of RAG over fine-tuning.

Walk me through a RAG pipeline architecture.

Answer

Building a RAG Pipeline

Architecture Overview

Step 1: Document Loading & Chunking

Step 2: Embedding & Indexing

Step 3: Retrieval

Step 4: Generation with Retrieved Context

Key Design Decisions

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Decision	Options	Recommendation
Chunk size	128–2048 tokens	512 tokens (balance context vs precision)
Overlap	0–20%	10–15% (avoids boundary loss)
Embedding model	text `text-embedding-3-small` , BGE, E5	text `text-embedding-3-small` (cost-effective)
Vector DB	Chroma, Pinecone, Weaviate, FAISS	Chroma (local), Pinecone (production)
Top-K	3–10	5 (balance context length vs noise)
Retrieval strategy	Cosine similarity, MMR, hybrid	MMR (reduces redundancy)