What is stateless and stateful in API in LLM?
Answer
Stateless vs Stateful APIs in LLMs
The most fundamental architectural decision when building LLM applications is whether the API is stateless (no memory between requests) or stateful (server maintains conversation history).
Stateless API
A stateless LLM API treats every request as an independent, isolated transaction. The server retains zero memory of previous interactions. You must send the full conversation history with every request.
How Stateless Works
Stateless API Examples
| Provider | Stateless API | Endpoint |
|---|---|---|
| OpenAI | Chat Completions API | text |
| Anthropic | Messages API | text |
| Gemini API | text | |
| Open Source | vLLM, Ollama, TGI | Various |
Stateless Code Example — OpenAI
pythonfrom openai import OpenAI client = OpenAI() # YOU manage conversation history conversation_history = [ {"role": "system", "content": "You are a helpful assistant."} ] def chat(user_message: str) -> str: conversation_history.append({"role": "user", "content": user_message}) # Full history sent EVERY time response = client.chat.completions.create( model="gpt-4o", messages=conversation_history ) reply = response.choices[0].message.content conversation_history.append({"role": "assistant", "content": reply}) return reply # Turn 1: sends system + 1 message print(chat("What is RAG?")) # Turn 2: sends system + turn 1 Q&A + new message print(chat("How does it differ from fine-tuning?")) # Turn 3: sends system + turns 1-2 + new message (growing!) print(chat("Which is cheaper?"))
Stateless Code Example — Anthropic
pythonimport anthropic client = anthropic.Anthropic() messages = [] def chat(user_message: str) -> str: messages.append({"role": "user", "content": user_message}) response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system="You are a helpful assistant.", messages=messages # Full history sent every time ) reply = response.content[0].text messages.append({"role": "assistant", "content": reply}) return reply
Stateful API
A stateful LLM API maintains conversation history on the server side. The client sends only the new message plus a session identifier — the server manages the full conversation thread internally.
How Stateful Works
Stateful API Examples
| Provider | Stateful API/Feature | State Mechanism |
|---|---|---|
| OpenAI | Responses API | text text |
| OpenAI (legacy) | Assistants API (deprecated Aug 2026) | Thread IDs |
| ChatGPT, Claude.ai | Web UI | Managed by provider |
| Vertex AI sessions, Context Caching | Cached content with TTL | |
| Anthropic | No stateful API (stateless by design) | Client manages state |
Stateful Code Example — OpenAI Responses API
pythonfrom openai import OpenAI client = OpenAI() # Turn 1 response1 = client.responses.create( model="gpt-4o", input="What is retrieval-augmented generation?", store=True # Enable server-side state ) print(response1.output_text) # Turn 2 — only new message; server has the history response2 = client.responses.create( model="gpt-4o", input="How does it compare to fine-tuning?", previous_response_id=response1.id # Chain to previous turn ) print(response2.output_text) # Turn 3 — continues the chain response3 = client.responses.create( model="gpt-4o", input="Which approach is cheaper?", previous_response_id=response2.id ) print(response3.output_text)
Stateful Code Example — OpenAI Conversations API
pythonfrom openai import OpenAI client = OpenAI() # Create a persistent conversation conversation = client.conversations.create() # Turn 1 response1 = client.responses.create( model="gpt-4o", input="Explain vector databases.", conversation_id=conversation.id, store=True ) # Turn 2 — context automatically managed response2 = client.responses.create( model="gpt-4o", input="Which one should I use for production RAG?", conversation_id=conversation.id, store=True )
Stateless vs Stateful — Complete Comparison
| Dimension | Stateless API | Stateful API |
|---|---|---|
| State location | Client-side (you manage) | Server-side (provider manages) |
| Each request contains | Full conversation history | Only new message + session ID |
| Memory between requests | None | Server maintains history |
| Scalability | Excellent — no session affinity needed | More complex — requires state store |
| Token billing | Pays for full history each turn | Potentially optimized |
| Debugging | Easy — requests are self-contained | Harder — hidden server state |
| Privacy | Client controls all data | Data stored on provider servers |
| Portability | Easy to switch providers | Vendor lock-in risk |
| Client complexity | Higher — must manage history | Lower — provider handles it |
| Context window management | Your responsibility | Server handles automatically |
| Best for | Simple apps, RAG, one-shot tasks | Long conversations, agents, multi-step reasoning |
Token Cost Comparison
Stateless APIs resend the full history every turn, increasing cost:
textStateless — 10-turn conversation (500 tokens per turn): Turn 1: 500 tokens input Turn 2: 1,000 tokens input (resends turn 1) Turn 3: 1,500 tokens input (resends turns 1-2) ... Turn 10: 5,000 tokens input (resends turns 1-9) ───────────────────────────────────── Total input billed: ~27,500 tokens Stateful — 10-turn conversation: Turn 1: 500 tokens input Turn 2: 500 tokens input (new message only) Turn 3: 500 tokens input (new message only) ... Turn 10: 500 tokens input ───────────────────────────────────── Total input billed: ~5,000 tokens (Server processes full context internally, but billing may differ)
How Major Providers Handle State
| Provider | Stateless API | Stateful Option | Notes |
|---|---|---|---|
| OpenAI | Chat Completions | Responses API + Conversations | Most advanced stateful options |
| Anthropic | Messages API | None (stateless by design) | Philosophy: statelessness = deterministic, debuggable |
| Gemini API | Context Caching (TTL-based) | Cached tokens billed at 10% cost | |
| AWS Bedrock | Converse API | Agents for Bedrock | Session management via agents |
| Open Source | vLLM, Ollama | None natively | Must build state externally |
Adding State to Stateless APIs
Since most LLM APIs are stateless, here are patterns to add state:
Pattern 1: In-Memory (Simple)
pythonclass ChatSession: def __init__(self, system_prompt: str): self.client = OpenAI() self.messages = [{"role": "system", "content": system_prompt}] def chat(self, user_message: str) -> str: self.messages.append({"role": "user", "content": user_message}) response = self.client.chat.completions.create( model="gpt-4o", messages=self.messages ) reply = response.choices[0].message.content self.messages.append({"role": "assistant", "content": reply}) return reply
Pattern 2: Redis-Backed (Production)
pythonimport json, redis from openai import OpenAI class RedisConversationStore: def __init__(self): self.redis = redis.from_url("redis://localhost:6379") self.client = OpenAI() def chat(self, session_id: str, user_message: str) -> str: # Retrieve existing history data = self.redis.get(f"chat:{session_id}") messages = json.loads(data) if data else [ {"role": "system", "content": "You are a helpful assistant."} ] messages.append({"role": "user", "content": user_message}) response = self.client.chat.completions.create( model="gpt-4o", messages=messages ) reply = response.choices[0].message.content messages.append({"role": "assistant", "content": reply}) # Save updated history (1 hour TTL) self.redis.setex(f"chat:{session_id}", 3600, json.dumps(messages)) return reply # Supports multiple concurrent users store = RedisConversationStore() store.chat("user-123", "What is RAG?") # User 123's conversation store.chat("user-456", "Explain attention.") # User 456's conversation store.chat("user-123", "Give me an example.") # Continues user 123's chat
Pattern 3: LangChain Memory Types
pythonfrom langchain_openai import ChatOpenAI from langchain.chains import ConversationChain from langchain.memory import ( ConversationBufferMemory, ConversationSummaryMemory, ConversationSummaryBufferMemory, ) llm = ChatOpenAI(model="gpt-4o") # Option 1: Buffer — stores everything verbatim conversation = ConversationChain( llm=llm, memory=ConversationBufferMemory() ) # Option 2: Summary — compresses old history into a summary conversation = ConversationChain( llm=llm, memory=ConversationSummaryMemory(llm=llm) ) # Option 3: Summary Buffer — keeps recent verbatim, summarizes older conversation = ConversationChain( llm=llm, memory=ConversationSummaryBufferMemory(llm=llm, max_token_limit=500) )
| Memory Type | Token Growth | Best For |
|---|---|---|
| BufferMemory | Linear O(n) | Short conversations, full accuracy |
| SummaryMemory | Constant O(1) | Long conversations |
| SummaryBufferMemory | Bounded | Production chatbots |
| WindowMemory | Fixed (last k turns) | Simple truncation |
| VectorStoreMemory | Semantic retrieval | Long-term agent memory |
Full Architecture — Stateless vs Stateful Flow
When to Use Which?
| Scenario | Recommendation | Why |
|---|---|---|
| Simple Q&A / one-shot tasks | Stateless | No history needed |
| RAG applications | Stateless | Full control over context injection |
| Short chatbots (<10 turns) | Stateless | Simple, low token waste |
| Long conversations (50+ turns) | Stateful | Avoid resending growing history |
| Multi-step agents with tools | Stateful | Tool state persists across steps |
| Reasoning models (o1, o3) | Stateful | Reasoning state preserved across turns |
| Privacy-sensitive apps | Stateless | No data stored on provider servers |
| Multi-provider support | Stateless | No vendor lock-in |
| High-volume production | Stateful + caching | Reduced token cost and latency |
Key takeaway: Most LLM APIs are stateless by default. Stateful behavior is built on top of stateless APIs — either by the provider (OpenAI Responses API) or by you (Redis, LangChain memory). Choose based on your conversation length, privacy needs, and cost sensitivity.