Design a RAG system for a customer support chatbot. Walk through your architecture.
#gen-ai#system-design#rag
Answer
RAG System for Customer Support Chatbot
High-Level Architecture
Component Breakdown
1. Document Ingestion Pipeline (Offline)
pythonfrom langchain_community.document_loaders import DirectoryLoader, PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import Pinecone def build_index(docs_path: str): # Load all support docs, FAQs, policy PDFs loader = DirectoryLoader(docs_path, glob="**/*.pdf", loader_cls=PyPDFLoader) documents = loader.load() splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=["\n\n", "\n", ". "] ) chunks = splitter.split_documents(documents) # Add metadata for filtering for chunk in chunks: chunk.metadata["doc_type"] = classify_doc(chunk.metadata["source"]) embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Pinecone.from_documents(chunks, embeddings, index_name="support-kb") return vectorstore
2. Query Pipeline (Online)
pythonfrom langchain_openai import ChatOpenAI from langchain.chains import RetrievalQA from langchain.prompts import ChatPromptTemplate SYSTEM_PROMPT = '''You are a helpful customer support agent for AcmeCorp. Answer questions using ONLY the provided context. If the answer is not in the context, say: "I don't have that information. Let me connect you with a human agent." Always be polite and concise. Context: {context}''' def build_rag_chain(vectorstore): llm = ChatOpenAI(model="gpt-4o", temperature=0) retriever = vectorstore.as_retriever( search_type="mmr", search_kwargs={"k": 5, "fetch_k": 20, "filter": {"doc_type": "support"}} ) prompt = ChatPromptTemplate.from_messages([ ("system", SYSTEM_PROMPT), ("human", "{question}"), ]) return RetrievalQA.from_chain_type( llm=llm, retriever=retriever, chain_type_kwargs={"prompt": prompt}, return_source_documents=True, )
Key Design Decisions
| Decision | Choice | Reason |
|---|---|---|
| Chunk size | 512 tokens | Balance context vs precision |
| Retrieval | MMR (Maximal Marginal Relevance) | Avoid redundant chunks |
| Embedding model | text-embedding-3-small | Best cost-quality for support docs |
| LLM temperature | 0.0 | Deterministic, factual answers |
| Metadata filtering | doc_type field | Separate support vs marketing docs |
| Fallback | Human agent handoff | For unanswerable queries |
Scalability Considerations
- Vector DB: Pinecone (managed, scales automatically)
- Caching: Redis for exact-match query cache (reduces API costs ~40%)
- Rate limiting: Token bucket per user (prevent abuse)
- Async API: FastAPI + async OpenAI client for concurrency
- Monitoring: LangSmith for tracing + custom metrics (RAGAS scores)