Walk me through a RAG pipeline architecture.
#gen-ai#rag#system-design
Answer
Building a RAG Pipeline
A RAG (Retrieval-Augmented Generation) pipeline has two phases: indexing (offline) and retrieval + generation (online).
Architecture Overview
Step 1: Document Loading & Chunking
pythonfrom langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter loader = PyPDFLoader("company_handbook.pdf") documents = loader.load() splitter = RecursiveCharacterTextSplitter( chunk_size=512, # ~380 words per chunk chunk_overlap=64, # overlap to avoid losing context at boundaries separators=["\n\n", "\n", ". ", " ", ""] ) chunks = splitter.split_documents(documents) print(f"Split into {len(chunks)} chunks")
Step 2: Embedding & Indexing
pythonfrom langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import Chroma embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # Embed and store all chunks (run once) vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db" )
Step 3: Retrieval
pythonretriever = vectorstore.as_retriever( search_type="mmr", # Maximal Marginal Relevance — avoids redundant chunks search_kwargs={"k": 5, "fetch_k": 20} ) docs = retriever.invoke("What is the parental leave policy?") for doc in docs: print(doc.page_content[:200])
Step 4: Generation with Retrieved Context
pythonfrom langchain_openai import ChatOpenAI from langchain.chains import RetrievalQA from langchain.prompts import PromptTemplate prompt_template = PromptTemplate( input_variables=["context", "question"], template='''Use ONLY the context below to answer the question. If the answer is not in the context, say "I don't know." Context: {context} Question: {question} Answer:''' ) qa_chain = RetrievalQA.from_chain_type( llm=ChatOpenAI(model="gpt-4o", temperature=0), retriever=retriever, chain_type_kwargs={"prompt": prompt_template}, return_source_documents=True ) result = qa_chain.invoke("What is the parental leave policy?") print(result["result"]) print("Sources:", [d.metadata["source"] for d in result["source_documents"]])
Key Design Decisions
| Decision | Options | Recommendation |
|---|---|---|
| Chunk size | 128–2048 tokens | 512 tokens (balance context vs precision) |
| Overlap | 0–20% | 10–15% (avoids boundary loss) |
| Embedding model | text | text |
| Vector DB | Chroma, Pinecone, Weaviate, FAISS | Chroma (local), Pinecone (production) |
| Top-K | 3–10 | 5 (balance context length vs noise) |
| Retrieval strategy | Cosine similarity, MMR, hybrid | MMR (reduces redundancy) |
Production tip: Always return source documents and display them to users. Traceability is the biggest advantage of RAG over fine-tuning.