Concept #35Hardsystem-design

Design a RAG system for a customer support chatbot. Walk through your architecture.

#gen-ai#system-design#rag

Answer

RAG System for Customer Support Chatbot

High-Level Architecture

Component Breakdown

1. Document Ingestion Pipeline (Offline)

python
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Pinecone

def build_index(docs_path: str):
    # Load all support docs, FAQs, policy PDFs
    loader = DirectoryLoader(docs_path, glob="**/*.pdf", loader_cls=PyPDFLoader)
    documents = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,
        separators=["\n\n", "\n", ". "]
    )
    chunks = splitter.split_documents(documents)

    # Add metadata for filtering
    for chunk in chunks:
        chunk.metadata["doc_type"] = classify_doc(chunk.metadata["source"])

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = Pinecone.from_documents(chunks, embeddings, index_name="support-kb")
    return vectorstore

2. Query Pipeline (Online)

python
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import ChatPromptTemplate

SYSTEM_PROMPT = '''You are a helpful customer support agent for AcmeCorp.
Answer questions using ONLY the provided context.
If the answer is not in the context, say: "I don't have that information. Let me connect you with a human agent."
Always be polite and concise.

Context:
{context}'''

def build_rag_chain(vectorstore):
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    retriever = vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={"k": 5, "fetch_k": 20, "filter": {"doc_type": "support"}}
    )
    prompt = ChatPromptTemplate.from_messages([
        ("system", SYSTEM_PROMPT),
        ("human", "{question}"),
    ])
    return RetrievalQA.from_chain_type(
        llm=llm, retriever=retriever,
        chain_type_kwargs={"prompt": prompt},
        return_source_documents=True,
    )

Key Design Decisions

DecisionChoiceReason
Chunk size512 tokensBalance context vs precision
RetrievalMMR (Maximal Marginal Relevance)Avoid redundant chunks
Embedding modeltext-embedding-3-smallBest cost-quality for support docs
LLM temperature0.0Deterministic, factual answers
Metadata filteringdoc_type fieldSeparate support vs marketing docs
FallbackHuman agent handoffFor unanswerable queries

Scalability Considerations

  • Vector DB: Pinecone (managed, scales automatically)
  • Caching: Redis for exact-match query cache (reduces API costs ~40%)
  • Rate limiting: Token bucket per user (prevent abuse)
  • Async API: FastAPI + async OpenAI client for concurrency
  • Monitoring: LangSmith for tracing + custom metrics (RAGAS scores)