Design a RAG system for a customer support chatbot. Walk through your architecture.

Question

Accepted Answer

## RAG System for Customer Support Chatbot

### High-Level Architecture

```mermaid
graph TD
    U[User Message] --> G[API Gateway]
    G --> SC[Safety Check]
    SC --> QR[Query Router]
    QR --> |In-scope| RET[Retriever]
    QR --> |Out-of-scope| FB[Fallback Handler]
    RET --> RR[Reranker]
    RR --> LLM[LLM Generator]
    LLM --> RES[Response]
    RES --> LOG[Logger / Analytics]
```

### Component Breakdown

**1. Document Ingestion Pipeline (Offline)**

```python
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Pinecone

def build_index(docs_path: str):
    # Load all support docs, FAQs, policy PDFs
    loader = DirectoryLoader(docs_path, glob="**/*.pdf", loader_cls=PyPDFLoader)
    documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,
        separators=["

", "
", ". "]
    )
    chunks = splitter.split_documents(documents)

# Add metadata for filtering
    for chunk in chunks:
        chunk.metadata["doc_type"] = classify_doc(chunk.metadata["source"])

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = Pinecone.from_documents(chunks, embeddings, index_name="support-kb")
    return vectorstore
```

**2. Query Pipeline (Online)**

```python
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import ChatPromptTemplate

SYSTEM_PROMPT = '''You are a helpful customer support agent for AcmeCorp.
Answer questions using ONLY the provided context.
If the answer is not in the context, say: "I don't have that information. Let me connect you with a human agent."
Always be polite and concise.

Context:
{context}'''

def build_rag_chain(vectorstore):
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    retriever = vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={"k": 5, "fetch_k": 20, "filter": {"doc_type": "support"}}
    )
    prompt = ChatPromptTemplate.from_messages([
        ("system", SYSTEM_PROMPT),
        ("human", "{question}"),
    ])
    return RetrievalQA.from_chain_type(
        llm=llm, retriever=retriever,
        chain_type_kwargs={"prompt": prompt},
        return_source_documents=True,
    )
```

### Key Design Decisions

| Decision | Choice | Reason |
|----------|--------|--------|
| **Chunk size** | 512 tokens | Balance context vs precision |
| **Retrieval** | MMR (Maximal Marginal Relevance) | Avoid redundant chunks |
| **Embedding model** | text-embedding-3-small | Best cost-quality for support docs |
| **LLM temperature** | 0.0 | Deterministic, factual answers |
| **Metadata filtering** | doc_type field | Separate support vs marketing docs |
| **Fallback** | Human agent handoff | For unanswerable queries |

### Scalability Considerations

- **Vector DB**: Pinecone (managed, scales automatically)
- **Caching**: Redis for exact-match query cache (reduces API costs ~40%)
- **Rate limiting**: Token bucket per user (prevent abuse)
- **Async API**: FastAPI + async OpenAI client for concurrency
- **Monitoring**: LangSmith for tracing + custom metrics (RAGAS scores)

Design a RAG system for a customer support chatbot. Walk through your architecture.

Answer

RAG System for Customer Support Chatbot

High-Level Architecture

Component Breakdown

Key Design Decisions

Scalability Considerations

Related Concepts

How would you handle out-of-scope queries in the RAG system?

How would you evaluate if your RAG system is better than fine-tuning?

Design a multi-agent system for research paper analysis.

How would you implement tool use in agents?

Provide me complete architecture of how a Chat LLM works in detail

Decision	Choice	Reason
Chunk size	512 tokens	Balance context vs precision
Retrieval	MMR (Maximal Marginal Relevance)	Avoid redundant chunks
Embedding model	text-embedding-3-small	Best cost-quality for support docs
LLM temperature	0.0	Deterministic, factual answers
Metadata filtering	doc_type field	Separate support vs marketing docs
Fallback	Human agent handoff	For unanswerable queries