Concept #8Hardgen-ai-fundamentals

Walk me through a RAG pipeline architecture.

#gen-ai#rag#system-design

Answer

Building a RAG Pipeline

A RAG (Retrieval-Augmented Generation) pipeline has two phases: indexing (offline) and retrieval + generation (online).

Architecture Overview

Step 1: Document Loading & Chunking

python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # ~380 words per chunk
    chunk_overlap=64,     # overlap to avoid losing context at boundaries
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

Step 2: Embedding & Indexing

python
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Embed and store all chunks (run once)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Step 3: Retrieval

python
retriever = vectorstore.as_retriever(
    search_type="mmr",        # Maximal Marginal Relevance — avoids redundant chunks
    search_kwargs={"k": 5, "fetch_k": 20}
)

docs = retriever.invoke("What is the parental leave policy?")
for doc in docs:
    print(doc.page_content[:200])

Step 4: Generation with Retrieved Context

python
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template='''Use ONLY the context below to answer the question.
If the answer is not in the context, say "I don't know."

Context:
{context}

Question: {question}

Answer:'''
)

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o", temperature=0),
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt_template},
    return_source_documents=True
)

result = qa_chain.invoke("What is the parental leave policy?")
print(result["result"])
print("Sources:", [d.metadata["source"] for d in result["source_documents"]])

Key Design Decisions

DecisionOptionsRecommendation
Chunk size128–2048 tokens512 tokens (balance context vs precision)
Overlap0–20%10–15% (avoids boundary loss)
Embedding model
text
text-embedding-3-small
, BGE, E5
text
text-embedding-3-small
(cost-effective)
Vector DBChroma, Pinecone, Weaviate, FAISSChroma (local), Pinecone (production)
Top-K3–105 (balance context length vs noise)
Retrieval strategyCosine similarity, MMR, hybridMMR (reduces redundancy)

Production tip: Always return source documents and display them to users. Traceability is the biggest advantage of RAG over fine-tuning.