What is stateless and stateful in API in LLM?

Question

Accepted Answer

## Stateless vs Stateful APIs in LLMs

The most fundamental architectural decision when building LLM applications is whether the API is **stateless** (no memory between requests) or **stateful** (server maintains conversation history).

---

## Stateless API

A **stateless** LLM API treats every request as an independent, isolated transaction. The server retains **zero memory** of previous interactions. You must send the **full conversation history** with every request.

### How Stateless Works

```mermaid
sequenceDiagram
    participant U as User/Client
    participant S as LLM API Server
    
    Note over U: Turn 1
    U->>S: System + "What is RAG?" (500 tokens)
    S-->>U: "RAG stands for..." (200 tokens)
    Note over S: Server forgets everything
    
    Note over U: Turn 2 — must resend Turn 1
    U->>S: System + Turn 1 Q&A + "How is it different from fine-tuning?" (1,200 tokens)
    S-->>U: "The key differences are..." (300 tokens)
    Note over S: Server forgets everything again
    
    Note over U: Turn 3 — must resend Turns 1 & 2
    U->>S: System + Turn 1 + Turn 2 + "Which is cheaper?" (2,000 tokens)
    S-->>U: "RAG is generally cheaper..." (250 tokens)
    Note over S: Server forgets everything again
```

### Stateless API Examples

| Provider | Stateless API | Endpoint |
|----------|--------------|----------|
| **OpenAI** | Chat Completions API | `/v1/chat/completions` |
| **Anthropic** | Messages API | `/v1/messages` |
| **Google** | Gemini API | `generateContent` |
| **Open Source** | vLLM, Ollama, TGI | Various |

### Stateless Code Example — OpenAI

```python
from openai import OpenAI
client = OpenAI()

# YOU manage conversation history
conversation_history = [
    {"role": "system", "content": "You are a helpful assistant."}
]

def chat(user_message: str) -> str:
    conversation_history.append({"role": "user", "content": user_message})

# Full history sent EVERY time
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=conversation_history
    )

reply = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": reply})
    return reply

# Turn 1: sends system + 1 message
print(chat("What is RAG?"))
# Turn 2: sends system + turn 1 Q&A + new message
print(chat("How does it differ from fine-tuning?"))
# Turn 3: sends system + turns 1-2 + new message (growing!)
print(chat("Which is cheaper?"))
```

### Stateless Code Example — Anthropic

```python
import anthropic
client = anthropic.Anthropic()

messages = []

def chat(user_message: str) -> str:
    messages.append({"role": "user", "content": user_message})

response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="You are a helpful assistant.",
        messages=messages  # Full history sent every time
    )

reply = response.content[0].text
    messages.append({"role": "assistant", "content": reply})
    return reply
```

---

## Stateful API

A **stateful** LLM API maintains conversation history **on the server side**. The client sends only the new message plus a session identifier — the server manages the full conversation thread internally.

### How Stateful Works

```mermaid
sequenceDiagram
    participant U as User/Client
    participant S as LLM API Server
    participant DB as Server State Store
    
    Note over U: Turn 1
    U->>S: "What is RAG?"
    S->>DB: Store conversation
    S-->>U: response_id: "resp_001" + answer
    
    Note over U: Turn 2 — only new message + reference
    U->>S: previous_response_id: "resp_001" + "How is it different?"
    S->>DB: Retrieve history, append new message
    S-->>U: response_id: "resp_002" + answer
    
    Note over U: Turn 3 — only new message + reference
    U->>S: previous_response_id: "resp_002" + "Which is cheaper?"
    S->>DB: Retrieve history, append new message
    S-->>U: response_id: "resp_003" + answer
    
    Note over S: Server maintains full conversation chain
```

### Stateful API Examples

| Provider | Stateful API/Feature | State Mechanism |
|----------|---------------------|----------------|
| **OpenAI** | Responses API | `previous_response_id`, `conversation_id` |
| **OpenAI (legacy)** | Assistants API (deprecated Aug 2026) | Thread IDs |
| **ChatGPT, Claude.ai** | Web UI | Managed by provider |
| **Google** | Vertex AI sessions, Context Caching | Cached content with TTL |
| **Anthropic** | No stateful API (stateless by design) | Client manages state |

### Stateful Code Example — OpenAI Responses API

```python
from openai import OpenAI
client = OpenAI()

# Turn 1
response1 = client.responses.create(
    model="gpt-4o",
    input="What is retrieval-augmented generation?",
    store=True  # Enable server-side state
)
print(response1.output_text)

# Turn 2 — only new message; server has the history
response2 = client.responses.create(
    model="gpt-4o",
    input="How does it compare to fine-tuning?",
    previous_response_id=response1.id  # Chain to previous turn
)
print(response2.output_text)

# Turn 3 — continues the chain
response3 = client.responses.create(
    model="gpt-4o",
    input="Which approach is cheaper?",
    previous_response_id=response2.id
)
print(response3.output_text)
```

### Stateful Code Example — OpenAI Conversations API

```python
from openai import OpenAI
client = OpenAI()

# Create a persistent conversation
conversation = client.conversations.create()

# Turn 1
response1 = client.responses.create(
    model="gpt-4o",
    input="Explain vector databases.",
    conversation_id=conversation.id,
    store=True
)

# Turn 2 — context automatically managed
response2 = client.responses.create(
    model="gpt-4o",
    input="Which one should I use for production RAG?",
    conversation_id=conversation.id,
    store=True
)
```

---

## Stateless vs Stateful — Complete Comparison

| Dimension | Stateless API | Stateful API |
|-----------|--------------|-------------|
| **State location** | Client-side (you manage) | Server-side (provider manages) |
| **Each request contains** | Full conversation history | Only new message + session ID |
| **Memory between requests** | None | Server maintains history |
| **Scalability** | Excellent — no session affinity needed | More complex — requires state store |
| **Token billing** | Pays for full history each turn | Potentially optimized |
| **Debugging** | Easy — requests are self-contained | Harder — hidden server state |
| **Privacy** | Client controls all data | Data stored on provider servers |
| **Portability** | Easy to switch providers | Vendor lock-in risk |
| **Client complexity** | Higher — must manage history | Lower — provider handles it |
| **Context window management** | Your responsibility | Server handles automatically |
| **Best for** | Simple apps, RAG, one-shot tasks | Long conversations, agents, multi-step reasoning |

---

## Token Cost Comparison

Stateless APIs resend the full history every turn, increasing cost:

```
Stateless — 10-turn conversation (500 tokens per turn):
  Turn 1:  500 tokens input
  Turn 2:  1,000 tokens input (resends turn 1)
  Turn 3:  1,500 tokens input (resends turns 1-2)
  ...
  Turn 10: 5,000 tokens input (resends turns 1-9)
  ─────────────────────────────────────
  Total input billed: ~27,500 tokens

Stateful — 10-turn conversation:
  Turn 1:  500 tokens input
  Turn 2:  500 tokens input (new message only)
  Turn 3:  500 tokens input (new message only)
  ...
  Turn 10: 500 tokens input
  ─────────────────────────────────────
  Total input billed: ~5,000 tokens
  (Server processes full context internally, but billing may differ)
```

---

## How Major Providers Handle State

| Provider | Stateless API | Stateful Option | Notes |
|----------|--------------|-----------------|-------|
| **OpenAI** | Chat Completions | Responses API + Conversations | Most advanced stateful options |
| **Anthropic** | Messages API | None (stateless by design) | Philosophy: statelessness = deterministic, debuggable |
| **Google** | Gemini API | Context Caching (TTL-based) | Cached tokens billed at 10% cost |
| **AWS Bedrock** | Converse API | Agents for Bedrock | Session management via agents |
| **Open Source** | vLLM, Ollama | None natively | Must build state externally |

---

## Adding State to Stateless APIs

Since most LLM APIs are stateless, here are patterns to add state:

### Pattern 1: In-Memory (Simple)

```python
class ChatSession:
    def __init__(self, system_prompt: str):
        self.client = OpenAI()
        self.messages = [{"role": "system", "content": system_prompt}]

def chat(self, user_message: str) -> str:
        self.messages.append({"role": "user", "content": user_message})
        response = self.client.chat.completions.create(
            model="gpt-4o", messages=self.messages
        )
        reply = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": reply})
        return reply
```

### Pattern 2: Redis-Backed (Production)

```python
import json, redis
from openai import OpenAI

class RedisConversationStore:
    def __init__(self):
        self.redis = redis.from_url("redis://localhost:6379")
        self.client = OpenAI()

def chat(self, session_id: str, user_message: str) -> str:
        # Retrieve existing history
        data = self.redis.get(f"chat:{session_id}")
        messages = json.loads(data) if data else [
            {"role": "system", "content": "You are a helpful assistant."}
        ]

messages.append({"role": "user", "content": user_message})

response = self.client.chat.completions.create(
            model="gpt-4o", messages=messages
        )

reply = response.choices[0].message.content
        messages.append({"role": "assistant", "content": reply})

# Save updated history (1 hour TTL)
        self.redis.setex(f"chat:{session_id}", 3600, json.dumps(messages))
        return reply

# Supports multiple concurrent users
store = RedisConversationStore()
store.chat("user-123", "What is RAG?")       # User 123's conversation
store.chat("user-456", "Explain attention.")  # User 456's conversation
store.chat("user-123", "Give me an example.") # Continues user 123's chat
```

### Pattern 3: LangChain Memory Types

```python
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import (
    ConversationBufferMemory,
    ConversationSummaryMemory,
    ConversationSummaryBufferMemory,
)

llm = ChatOpenAI(model="gpt-4o")

# Option 1: Buffer — stores everything verbatim
conversation = ConversationChain(
    llm=llm,
    memory=ConversationBufferMemory()
)

# Option 2: Summary — compresses old history into a summary
conversation = ConversationChain(
    llm=llm,
    memory=ConversationSummaryMemory(llm=llm)
)

# Option 3: Summary Buffer — keeps recent verbatim, summarizes older
conversation = ConversationChain(
    llm=llm,
    memory=ConversationSummaryBufferMemory(llm=llm, max_token_limit=500)
)
```

| Memory Type | Token Growth | Best For |
|-------------|-------------|----------|
| **BufferMemory** | Linear O(n) | Short conversations, full accuracy |
| **SummaryMemory** | Constant O(1) | Long conversations |
| **SummaryBufferMemory** | Bounded | Production chatbots |
| **WindowMemory** | Fixed (last k turns) | Simple truncation |
| **VectorStoreMemory** | Semantic retrieval | Long-term agent memory |

---

## Full Architecture — Stateless vs Stateful Flow

style Stateless fill:#dbeafe,stroke:#2563eb
    style Stateful fill:#d1fae5,stroke:#059669
```

---

## When to Use Which?

| Scenario | Recommendation | Why |
|----------|---------------|-----|
| Simple Q&A / one-shot tasks | Stateless | No history needed |
| RAG applications | Stateless | Full control over context injection |
| Short chatbots (<10 turns) | Stateless | Simple, low token waste |
| Long conversations (50+ turns) | Stateful | Avoid resending growing history |
| Multi-step agents with tools | Stateful | Tool state persists across steps |
| Reasoning models (o1, o3) | Stateful | Reasoning state preserved across turns |
| Privacy-sensitive apps | Stateless | No data stored on provider servers |
| Multi-provider support | Stateless | No vendor lock-in |
| High-volume production | Stateful + caching | Reduced token cost and latency |

> **Key takeaway:** Most LLM APIs are stateless by default. Stateful behavior is built **on top** of stateless APIs — either by the provider (OpenAI Responses API) or by you (Redis, LangChain memory). Choose based on your conversation length, privacy needs, and cost sensitivity.

What is stateless and stateful in API in LLM?

Answer

Stateless vs Stateful APIs in LLMs

Stateless API

How Stateless Works

Stateless API Examples

Stateless Code Example — OpenAI

Stateless Code Example — Anthropic

Stateful API

How Stateful Works

Stateful API Examples

Stateful Code Example — OpenAI Responses API

Stateful Code Example — OpenAI Conversations API

Stateless vs Stateful — Complete Comparison

Token Cost Comparison

How Major Providers Handle State

Adding State to Stateless APIs

Pattern 1: In-Memory (Simple)

Pattern 2: Redis-Backed (Production)

Pattern 3: LangChain Memory Types

Full Architecture — Stateless vs Stateful Flow

When to Use Which?

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Provider	Stateless API	Endpoint
OpenAI	Chat Completions API	text `/v1/chat/completions`
Anthropic	Messages API	text `/v1/messages`
Google	Gemini API	text `generateContent`
Open Source	vLLM, Ollama, TGI	Various

Provider	Stateful API/Feature	State Mechanism
OpenAI	Responses API	text `previous_response_id` , text `conversation_id`
OpenAI (legacy)	Assistants API (deprecated Aug 2026)	Thread IDs
ChatGPT, Claude.ai	Web UI	Managed by provider
Google	Vertex AI sessions, Context Caching	Cached content with TTL
Anthropic	No stateful API (stateless by design)	Client manages state

Dimension	Stateless API	Stateful API
State location	Client-side (you manage)	Server-side (provider manages)
Each request contains	Full conversation history	Only new message + session ID
Memory between requests	None	Server maintains history
Scalability	Excellent — no session affinity needed	More complex — requires state store
Token billing	Pays for full history each turn	Potentially optimized
Debugging	Easy — requests are self-contained	Harder — hidden server state
Privacy	Client controls all data	Data stored on provider servers
Portability	Easy to switch providers	Vendor lock-in risk
Client complexity	Higher — must manage history	Lower — provider handles it
Context window management	Your responsibility	Server handles automatically
Best for	Simple apps, RAG, one-shot tasks	Long conversations, agents, multi-step reasoning

Provider	Stateless API	Stateful Option	Notes
OpenAI	Chat Completions	Responses API + Conversations	Most advanced stateful options
Anthropic	Messages API	None (stateless by design)	Philosophy: statelessness = deterministic, debuggable
Google	Gemini API	Context Caching (TTL-based)	Cached tokens billed at 10% cost
AWS Bedrock	Converse API	Agents for Bedrock	Session management via agents
Open Source	vLLM, Ollama	None natively	Must build state externally

Memory Type	Token Growth	Best For
BufferMemory	Linear O(n)	Short conversations, full accuracy
SummaryMemory	Constant O(1)	Long conversations
SummaryBufferMemory	Bounded	Production chatbots
WindowMemory	Fixed (last k turns)	Simple truncation
VectorStoreMemory	Semantic retrieval	Long-term agent memory

Scenario	Recommendation	Why
Simple Q&A / one-shot tasks	Stateless	No history needed
RAG applications	Stateless	Full control over context injection
Short chatbots (<10 turns)	Stateless	Simple, low token waste
Long conversations (50+ turns)	Stateful	Avoid resending growing history
Multi-step agents with tools	Stateful	Tool state persists across steps
Reasoning models (o1, o3)	Stateful	Reasoning state preserved across turns
Privacy-sensitive apps	Stateless	No data stored on provider servers
Multi-provider support	Stateless	No vendor lock-in
High-volume production	Stateful + caching	Reduced token cost and latency