What is stateless and stateful in API in LLM?

#gen-ai#llm#api#stateless#stateful#architecture#langchain#conversation-management

Answer

Stateless vs Stateful APIs in LLMs

The most fundamental architectural decision when building LLM applications is whether the API is stateless (no memory between requests) or stateful (server maintains conversation history).


Stateless API

A stateless LLM API treats every request as an independent, isolated transaction. The server retains zero memory of previous interactions. You must send the full conversation history with every request.

How Stateless Works

Stateless API Examples

ProviderStateless APIEndpoint
OpenAIChat Completions API
text
/v1/chat/completions
AnthropicMessages API
text
/v1/messages
GoogleGemini API
text
generateContent
Open SourcevLLM, Ollama, TGIVarious

Stateless Code Example — OpenAI

python
from openai import OpenAI
client = OpenAI()

# YOU manage conversation history
conversation_history = [
    {"role": "system", "content": "You are a helpful assistant."}
]

def chat(user_message: str) -> str:
    conversation_history.append({"role": "user", "content": user_message})

    # Full history sent EVERY time
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=conversation_history
    )

    reply = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": reply})
    return reply

# Turn 1: sends system + 1 message
print(chat("What is RAG?"))
# Turn 2: sends system + turn 1 Q&A + new message
print(chat("How does it differ from fine-tuning?"))
# Turn 3: sends system + turns 1-2 + new message (growing!)
print(chat("Which is cheaper?"))

Stateless Code Example — Anthropic

python
import anthropic
client = anthropic.Anthropic()

messages = []

def chat(user_message: str) -> str:
    messages.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="You are a helpful assistant.",
        messages=messages  # Full history sent every time
    )

    reply = response.content[0].text
    messages.append({"role": "assistant", "content": reply})
    return reply

Stateful API

A stateful LLM API maintains conversation history on the server side. The client sends only the new message plus a session identifier — the server manages the full conversation thread internally.

How Stateful Works

Stateful API Examples

ProviderStateful API/FeatureState Mechanism
OpenAIResponses API
text
previous_response_id
,
text
conversation_id
OpenAI (legacy)Assistants API (deprecated Aug 2026)Thread IDs
ChatGPT, Claude.aiWeb UIManaged by provider
GoogleVertex AI sessions, Context CachingCached content with TTL
AnthropicNo stateful API (stateless by design)Client manages state

Stateful Code Example — OpenAI Responses API

python
from openai import OpenAI
client = OpenAI()

# Turn 1
response1 = client.responses.create(
    model="gpt-4o",
    input="What is retrieval-augmented generation?",
    store=True  # Enable server-side state
)
print(response1.output_text)

# Turn 2 — only new message; server has the history
response2 = client.responses.create(
    model="gpt-4o",
    input="How does it compare to fine-tuning?",
    previous_response_id=response1.id  # Chain to previous turn
)
print(response2.output_text)

# Turn 3 — continues the chain
response3 = client.responses.create(
    model="gpt-4o",
    input="Which approach is cheaper?",
    previous_response_id=response2.id
)
print(response3.output_text)

Stateful Code Example — OpenAI Conversations API

python
from openai import OpenAI
client = OpenAI()

# Create a persistent conversation
conversation = client.conversations.create()

# Turn 1
response1 = client.responses.create(
    model="gpt-4o",
    input="Explain vector databases.",
    conversation_id=conversation.id,
    store=True
)

# Turn 2 — context automatically managed
response2 = client.responses.create(
    model="gpt-4o",
    input="Which one should I use for production RAG?",
    conversation_id=conversation.id,
    store=True
)

Stateless vs Stateful — Complete Comparison

DimensionStateless APIStateful API
State locationClient-side (you manage)Server-side (provider manages)
Each request containsFull conversation historyOnly new message + session ID
Memory between requestsNoneServer maintains history
ScalabilityExcellent — no session affinity neededMore complex — requires state store
Token billingPays for full history each turnPotentially optimized
DebuggingEasy — requests are self-containedHarder — hidden server state
PrivacyClient controls all dataData stored on provider servers
PortabilityEasy to switch providersVendor lock-in risk
Client complexityHigher — must manage historyLower — provider handles it
Context window managementYour responsibilityServer handles automatically
Best forSimple apps, RAG, one-shot tasksLong conversations, agents, multi-step reasoning

Token Cost Comparison

Stateless APIs resend the full history every turn, increasing cost:

text
Stateless — 10-turn conversation (500 tokens per turn):
  Turn 1:  500 tokens input
  Turn 2:  1,000 tokens input (resends turn 1)
  Turn 3:  1,500 tokens input (resends turns 1-2)
  ...
  Turn 10: 5,000 tokens input (resends turns 1-9)
  ─────────────────────────────────────
  Total input billed: ~27,500 tokens

Stateful — 10-turn conversation:
  Turn 1:  500 tokens input
  Turn 2:  500 tokens input (new message only)
  Turn 3:  500 tokens input (new message only)
  ...
  Turn 10: 500 tokens input
  ─────────────────────────────────────
  Total input billed: ~5,000 tokens
  (Server processes full context internally, but billing may differ)

How Major Providers Handle State

ProviderStateless APIStateful OptionNotes
OpenAIChat CompletionsResponses API + ConversationsMost advanced stateful options
AnthropicMessages APINone (stateless by design)Philosophy: statelessness = deterministic, debuggable
GoogleGemini APIContext Caching (TTL-based)Cached tokens billed at 10% cost
AWS BedrockConverse APIAgents for BedrockSession management via agents
Open SourcevLLM, OllamaNone nativelyMust build state externally

Adding State to Stateless APIs

Since most LLM APIs are stateless, here are patterns to add state:

Pattern 1: In-Memory (Simple)

python
class ChatSession:
    def __init__(self, system_prompt: str):
        self.client = OpenAI()
        self.messages = [{"role": "system", "content": system_prompt}]

    def chat(self, user_message: str) -> str:
        self.messages.append({"role": "user", "content": user_message})
        response = self.client.chat.completions.create(
            model="gpt-4o", messages=self.messages
        )
        reply = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": reply})
        return reply

Pattern 2: Redis-Backed (Production)

python
import json, redis
from openai import OpenAI

class RedisConversationStore:
    def __init__(self):
        self.redis = redis.from_url("redis://localhost:6379")
        self.client = OpenAI()

    def chat(self, session_id: str, user_message: str) -> str:
        # Retrieve existing history
        data = self.redis.get(f"chat:{session_id}")
        messages = json.loads(data) if data else [
            {"role": "system", "content": "You are a helpful assistant."}
        ]

        messages.append({"role": "user", "content": user_message})

        response = self.client.chat.completions.create(
            model="gpt-4o", messages=messages
        )

        reply = response.choices[0].message.content
        messages.append({"role": "assistant", "content": reply})

        # Save updated history (1 hour TTL)
        self.redis.setex(f"chat:{session_id}", 3600, json.dumps(messages))
        return reply

# Supports multiple concurrent users
store = RedisConversationStore()
store.chat("user-123", "What is RAG?")       # User 123's conversation
store.chat("user-456", "Explain attention.")  # User 456's conversation
store.chat("user-123", "Give me an example.") # Continues user 123's chat

Pattern 3: LangChain Memory Types

python
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import (
    ConversationBufferMemory,
    ConversationSummaryMemory,
    ConversationSummaryBufferMemory,
)

llm = ChatOpenAI(model="gpt-4o")

# Option 1: Buffer — stores everything verbatim
conversation = ConversationChain(
    llm=llm,
    memory=ConversationBufferMemory()
)

# Option 2: Summary — compresses old history into a summary
conversation = ConversationChain(
    llm=llm,
    memory=ConversationSummaryMemory(llm=llm)
)

# Option 3: Summary Buffer — keeps recent verbatim, summarizes older
conversation = ConversationChain(
    llm=llm,
    memory=ConversationSummaryBufferMemory(llm=llm, max_token_limit=500)
)
Memory TypeToken GrowthBest For
BufferMemoryLinear O(n)Short conversations, full accuracy
SummaryMemoryConstant O(1)Long conversations
SummaryBufferMemoryBoundedProduction chatbots
WindowMemoryFixed (last k turns)Simple truncation
VectorStoreMemorySemantic retrievalLong-term agent memory

Full Architecture — Stateless vs Stateful Flow


When to Use Which?

ScenarioRecommendationWhy
Simple Q&A / one-shot tasksStatelessNo history needed
RAG applicationsStatelessFull control over context injection
Short chatbots (<10 turns)StatelessSimple, low token waste
Long conversations (50+ turns)StatefulAvoid resending growing history
Multi-step agents with toolsStatefulTool state persists across steps
Reasoning models (o1, o3)StatefulReasoning state preserved across turns
Privacy-sensitive appsStatelessNo data stored on provider servers
Multi-provider supportStatelessNo vendor lock-in
High-volume productionStateful + cachingReduced token cost and latency

Key takeaway: Most LLM APIs are stateless by default. Stateful behavior is built on top of stateless APIs — either by the provider (OpenAI Responses API) or by you (Redis, LangChain memory). Choose based on your conversation length, privacy needs, and cost sensitivity.