What is a context window and how to manage the context window in LLM?

Question

Accepted Answer

## What Is a Context Window? A **context window** is the maximum number of tokens an LLM can process in a single request — including both **input tokens** (your prompt, system instructions, conversation history) and **output tokens** (the model's response). Everything outside this window is invisible to the model. $$ ext{Context Window} = ext{Input Tokens} + ext{Output Tokens}$$ ``` You type: "Explain RAG in detail" (5 tokens) + System prompt (500 tokens) + Conversation history (2,000 tokens) + Model response (1,000 tokens) = Total: 3,505 tokens used from context window ``` > **Rule of thumb:** 1 token ≈ 0.75 English words. So 128K tokens ≈ 96,000 words ≈ a 192-page book. --- ## Per-Prompt or Per-Session? This is one of the most commonly misunderstood aspects of context windows. ### API Usage: Per-Request (Stateless) Every API call is **completely independent**. The model has **zero memory** between requests. You must include the entire conversation history in each request: ```python from openai import OpenAI client = OpenAI() # Each API call must include ALL prior messages messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is RAG?"}, {"role": "assistant", "content": "RAG stands for..."}, # previous response {"role": "user", "content": "How does it differ from fine-tuning?"}, # new question ] # ALL of the above counts against the context window response = client.chat.completions.create(model="gpt-4o", messages=messages) ``` ### Chat Interfaces: Per-Conversation (Managed by Provider) Services like ChatGPT, Claude.ai, and Gemini manage history automatically: | Interface | Behavior When Limit Approached | |-----------|-------------------------------| | **ChatGPT** | Silently truncates older messages or summarizes | | **Claude.ai** | Warns user, suggests starting a new conversation | | **Gemini** | May summarize older context or return an error | | **Local (Ollama)** | Typically truncates from the beginning | ### Key Distinction | Aspect | API (Stateless) | Chat Interface (Managed) | |--------|----------------|-------------------------| | **Scope** | Per individual request | Per conversation session | | **Memory** | None — you manage history | Provider manages history | | **When exceeded** | Returns error immediately | Truncates/summarizes silently | | **Developer control** | Full control over what's included | No control | | **Billing** | Pay for ALL tokens in every request | Usually subscription-based | --- ## Context Window Limits — All Major LLMs (2025–2026) ### OpenAI | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | GPT-4o | 128K | 16,384 | tiktoken (`o200k_base`) | | GPT-4o-mini | 128K | 16,384 | tiktoken (`o200k_base`) | | GPT-4.1 | ~1M | 32,768 | tiktoken (`o200k_base`) | | GPT-4.1-mini | ~1M | 32,768 | tiktoken (`o200k_base`) | | GPT-4.1-nano | ~1M | 32,768 | tiktoken (`o200k_base`) | | o1 | 200K | 100,000 | tiktoken (`o200k_base`) | | o3 | 200K | 100,000 | tiktoken (`o200k_base`) | | o3-mini | 200K | 100,000 | tiktoken (`o200k_base`) | | o4-mini | 200K | 100,000 | tiktoken (`o200k_base`) | ### Anthropic | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Claude 3.5 Sonnet | 200K | 8,192 | Claude BPE | | Claude 3.5 Haiku | 200K | 8,192 | Claude BPE | | Claude Opus 4 | 200K | 32,000 | Claude BPE | | Claude Sonnet 4 | 200K | 64,000 | Claude BPE | ### Google | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Gemini 1.5 Pro | 2M | 8,192 | SentencePiece | | Gemini 1.5 Flash | 1M | 8,192 | SentencePiece | | Gemini 2.0 Flash | 1M | 8,192 | SentencePiece | | Gemini 2.5 Pro | 1M | 65,536 | SentencePiece | | Gemini 2.5 Flash | 1M | 65,536 | SentencePiece | ### Meta (Open Source) | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Llama 3.1 (8B/70B/405B) | 128K | Configurable | BPE (128K vocab) | | Llama 3.3 70B | 128K | 32,768 | BPE (128K vocab) | | Llama 4 Scout (109B MoE) | 10M | ~32,768 | BPE (200K vocab) | | Llama 4 Maverick (400B MoE) | 1M | ~32,768 | BPE (200K vocab) | ### Mistral | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Mistral Large 2 | 128K | ~8,192 | SentencePiece BPE | | Mistral Small 3.2 | 128K | ~8,192 | SentencePiece BPE | | Mixtral 8x22B | 64K | ~4,096 | SentencePiece BPE | ### DeepSeek | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | DeepSeek-V3 | 128K | 8,000 | Custom BBPE (100K vocab) | | DeepSeek-R1 | 128K | 32,768 | Custom BBPE (100K vocab) | ### Others | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Grok-2 | 131K | 131K | Custom BPE | | Grok-3 | 131K | ~32,768 | Custom BPE | | Command R+ | 128K | 4,000 | Custom BPE | | Command A | 256K | 8,000 | Custom BPE | | Qwen 2.5 Turbo | 1M | ~8,192 | Custom BPE (152K vocab) | | Qwen 3 | 32K–131K | ~8,192 | Custom BPE | --- ## Why Context Windows Are Limited ### 1. Quadratic Attention Complexity — O(n²) Self-attention computes an `n × n` attention matrix. Doubling context quadruples compute: | Context Length | Attention Matrix Size | Relative Cost | |---------------|----------------------|---------------| | 4K | 16M entries | 1x | | 32K | 1B entries | 64x | | 128K | 16B entries | 1,024x | | 1M | 1T entries | 62,500x | ### 2. KV Cache Memory During generation, Key/Value vectors for every token are cached: | Model Size | Context | Approximate KV Cache (FP16) | |-----------|---------|----------------------------| | 7B | 4K | ~1 GB | | 7B | 128K | ~32 GB | | 70B | 128K | ~160 GB | | 70B | 1M | ~1.2 TB | ### 3. Positional Encoding Limits Models trained with RoPE (Rotary Position Embeddings) degrade beyond their training length. Extending requires techniques like YaRN or LongRoPE. ### Solutions Used by Modern Models | Technique | How It Works | Used By | |-----------|-------------|--------| | **Flash Attention** | Memory-efficient exact attention via tiling — O(n) memory | Nearly all modern LLMs | | **Grouped Query Attention (GQA)** | Multiple query heads share K/V heads, reducing KV cache 4–8x | Llama 3, Mistral, Gemini | | **Multi-Head Latent Attention (MLA)** | Compresses K/V into low-rank latent space | DeepSeek-V3 | | **Sliding Window Attention** | Each layer attends to only a fixed window | Mistral, Mixtral | | **Ring Attention** | Distributes sequence across GPUs in a ring | Llama 4 Scout (10M context) | | **YaRN / LongRoPE** | Extends RoPE to 16–512x original training length | Llama, Qwen, Phi | --- ## Context Window Management Strategies ### Strategy 1: Token Counting Before Sending Always count tokens before making API calls to avoid errors. ```python import tiktoken def check_fits_context( messages: list[dict], model: str = "gpt-4o", max_context: int = 128_000, reserve_for_output: int = 4_096 ) -> bool: enc = tiktoken.encoding_for_model(model) total = sum(len(enc.encode(m["content"])) for m in messages) available = max_context - reserve_for_output print(f"Using {total}/{available} tokens ({total/available*100:.1f}%)") return total <= available ``` ### Strategy 2: Sliding Window Truncation Keep only the most recent messages that fit. ```python def sliding_window_truncate(messages, max_tokens=120000): """Keep system prompt + most recent messages within budget.""" enc = tiktoken.encoding_for_model("gpt-4o") system = [m for m in messages if m["role"] == "system"] others = [m for m in messages if m["role"] != "system"] budget = max_tokens - sum(len(enc.encode(m["content"])) for m in system) kept = [] for msg in reversed(others): msg_tokens = len(enc.encode(msg["content"])) if budget - msg_tokens < 0: break kept.insert(0, msg) budget -= msg_tokens return system + kept ``` ### Strategy 3: Summarize Older Messages Compress old history using a smaller/cheaper model. ```python async def summarize_and_compact(messages, client): """Summarize old messages, keep recent ones verbatim.""" old_messages = messages[:-10] # older history recent = messages[-10:] # keep last 10 turns summary_prompt = "Summarize this conversation concisely: " for m in old_messages: summary_prompt += f"{m['role']}: {m['content']} " summary = await client.chat.completions.create( model="gpt-4o-mini", # use cheap model for summarization messages=[{"role": "user", "content": summary_prompt}], max_tokens=500 ) return [ {"role": "system", "content": f"Previous conversation summary: {summary.choices[0].message.content}"}, *recent ] ``` ### Strategy 4: RAG (Retrieval Augmented Generation) Store information externally, retrieve only what's relevant. ```python from langchain.vectorstores import Chroma from langchain.embeddings import OpenAIEmbeddings # Store documents in vector DB (once) vectorstore = Chroma.from_documents(documents, OpenAIEmbeddings()) # At query time, retrieve only relevant chunks relevant_docs = vectorstore.similarity_search(user_query, k=5) context = " ".join([doc.page_content for doc in relevant_docs]) # Inject only relevant context into prompt messages = [ {"role": "system", "content": f"Answer using this context: {context}"}, {"role": "user", "content": user_query} ] ``` ### Strategy 5: Hierarchical Context Management Layer different types of context at different granularities: ``` ┌─────────────────────────────────────────────┐ │ Layer 1: System Prompt (always present) │ ~500 tokens │ Role, rules, formatting instructions │ ├─────────────────────────────────────────────┤ │ Layer 2: Long-term Memory (compact) │ ~500 tokens │ User preferences, key facts │ ├─────────────────────────────────────────────┤ │ Layer 3: Session Summary (compressed) │ ~1,000 tokens │ Condensed history of current conversation │ ├─────────────────────────────────────────────┤ │ Layer 4: Recent Messages (full detail) │ ~5,000 tokens │ Last 5-10 turns verbatim │ ├─────────────────────────────────────────────┤ │ Layer 5: Retrieved Context (dynamic) │ ~3,000 tokens │ RAG results relevant to current query │ ├─────────────────────────────────────────────┤ │ Layer 6: Reserved for Output │ ~4,096 tokens │ Space for model's response │ └─────────────────────────────────────────────┘ ``` ### Strategy 6: Prompt Caching All major providers now offer caching to reduce cost and latency for repeated context: | Provider | Caching Type | Savings | |----------|-------------|--------| | **OpenAI** | Automatic (hashes prompt prefix) | 50% on input tokens | | **Anthropic** | Explicit (`cache_control` markers) | 90% on cached tokens | | **Google** | Automatic + explicit | 75% on cached tokens | > **Best practice:** Place static content (system prompts, reference documents) at the **start** of the prompt and dynamic content at the **end** to maximize cache hits. --- ## The "Lost in the Middle" Problem Research shows LLMs have a **U-shaped performance curve** — they recall information best from the **beginning** and **end** of the context, with **30%+ degradation** for content in the middle. ``` Recall Performance by Position: High ██████████░░░░░░░░░░░░██████████ Low ▲ Beginning Middle End ↑ Worst recall here ``` **Mitigation:** Place the most critical information at the start or end of your prompt — never buried in the middle. --- ## Effective vs. Advertised Context Length | Advertised | Effective (Reliable) | Notes | |-----------|---------------------|-------| | 128K | ~80–100K | Most models handle well | | 200K | ~130–160K | Performance degrades toward limit | | 1M | ~600–800K | Significant quality drop at extremes | | 10M | ~2–5M (estimated) | Very new, limited benchmarks | > **Key insight:** Advertised context ≠ effective context. Plan for **60–80%** of the advertised maximum for reliable performance. --- ## Context Management Decision Tree ```mermaid graph TD A["How much context do you need?"] --> B{"< 50% of model's limit?"} B -->|Yes| C["Send everything — no management needed"] B -->|No| D{"Is all context equally important?"} D -->|No| E["Use RAG — retrieve only relevant parts"] D -->|Yes| F{"Is it a multi-turn conversation?"} F -->|Yes| G["Summarize older turns + keep recent verbatim"] F -->|No| H{"Is context static across requests?"} H -->|Yes| I["Use prompt caching to reduce cost"] H -->|No| J["Use sliding window truncation"] style A fill:#dbeafe,stroke:#2563eb style C fill:#d1fae5,stroke:#059669 style E fill:#fef3c7,stroke:#d97706 style G fill:#f3e8ff,stroke:#9333ea style I fill:#fce7f3,stroke:#db2777 style J fill:#fee2e2,stroke:#dc2626 ```

What is a context window and how to manage the context window in LLM?

Answer

What Is a Context Window?

Per-Prompt or Per-Session?

API Usage: Per-Request (Stateless)

Chat Interfaces: Per-Conversation (Managed by Provider)

Key Distinction

Context Window Limits — All Major LLMs (2025–2026)

OpenAI

Anthropic

Google

Meta (Open Source)

Mistral

DeepSeek

Others

Why Context Windows Are Limited

1. Quadratic Attention Complexity — O(n²)

2. KV Cache Memory

3. Positional Encoding Limits

Solutions Used by Modern Models

Context Window Management Strategies

Strategy 1: Token Counting Before Sending

Strategy 2: Sliding Window Truncation

Strategy 3: Summarize Older Messages

Strategy 4: RAG (Retrieval Augmented Generation)

Strategy 5: Hierarchical Context Management

Strategy 6: Prompt Caching

The "Lost in the Middle" Problem

Effective vs. Advertised Context Length

Context Management Decision Tree

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Interface	Behavior When Limit Approached
ChatGPT	Silently truncates older messages or summarizes
Claude.ai	Warns user, suggests starting a new conversation
Gemini	May summarize older context or return an error
Local (Ollama)	Typically truncates from the beginning

Aspect	API (Stateless)	Chat Interface (Managed)
Scope	Per individual request	Per conversation session
Memory	None — you manage history	Provider manages history
When exceeded	Returns error immediately	Truncates/summarizes silently
Developer control	Full control over what's included	No control
Billing	Pay for ALL tokens in every request	Usually subscription-based

Model	Context Window	Max Output	Tokenizer
GPT-4o	128K	16,384	tiktoken ( text `o200k_base` )
GPT-4o-mini	128K	16,384	tiktoken ( text `o200k_base` )
GPT-4.1	~1M	32,768	tiktoken ( text `o200k_base` )
GPT-4.1-mini	~1M	32,768	tiktoken ( text `o200k_base` )
GPT-4.1-nano	~1M	32,768	tiktoken ( text `o200k_base` )
o1	200K	100,000	tiktoken ( text `o200k_base` )
o3	200K	100,000	tiktoken ( text `o200k_base` )
o3-mini	200K	100,000	tiktoken ( text `o200k_base` )
o4-mini	200K	100,000	tiktoken ( text `o200k_base` )

Model	Context Window	Max Output	Tokenizer
Claude 3.5 Sonnet	200K	8,192	Claude BPE
Claude 3.5 Haiku	200K	8,192	Claude BPE
Claude Opus 4	200K	32,000	Claude BPE
Claude Sonnet 4	200K	64,000	Claude BPE

Model	Context Window	Max Output	Tokenizer
Gemini 1.5 Pro	2M	8,192	SentencePiece
Gemini 1.5 Flash	1M	8,192	SentencePiece
Gemini 2.0 Flash	1M	8,192	SentencePiece
Gemini 2.5 Pro	1M	65,536	SentencePiece
Gemini 2.5 Flash	1M	65,536	SentencePiece

Model	Context Window	Max Output	Tokenizer
Llama 3.1 (8B/70B/405B)	128K	Configurable	BPE (128K vocab)
Llama 3.3 70B	128K	32,768	BPE (128K vocab)
Llama 4 Scout (109B MoE)	10M	~32,768	BPE (200K vocab)
Llama 4 Maverick (400B MoE)	1M	~32,768	BPE (200K vocab)

Model	Context Window	Max Output	Tokenizer
Mistral Large 2	128K	~8,192	SentencePiece BPE
Mistral Small 3.2	128K	~8,192	SentencePiece BPE
Mixtral 8x22B	64K	~4,096	SentencePiece BPE

Model	Context Window	Max Output	Tokenizer
DeepSeek-V3	128K	8,000	Custom BBPE (100K vocab)
DeepSeek-R1	128K	32,768	Custom BBPE (100K vocab)

Model	Context Window	Max Output	Tokenizer
Grok-2	131K	131K	Custom BPE
Grok-3	131K	~32,768	Custom BPE
Command R+	128K	4,000	Custom BPE
Command A	256K	8,000	Custom BPE
Qwen 2.5 Turbo	1M	~8,192	Custom BPE (152K vocab)
Qwen 3	32K–131K	~8,192	Custom BPE

Context Length	Attention Matrix Size	Relative Cost
4K	16M entries	1x
32K	1B entries	64x
128K	16B entries	1,024x
1M	1T entries	62,500x

Technique	How It Works	Used By
Flash Attention	Memory-efficient exact attention via tiling — O(n) memory	Nearly all modern LLMs
Grouped Query Attention (GQA)	Multiple query heads share K/V heads, reducing KV cache 4–8x	Llama 3, Mistral, Gemini
Multi-Head Latent Attention (MLA)	Compresses K/V into low-rank latent space	DeepSeek-V3
Sliding Window Attention	Each layer attends to only a fixed window	Mistral, Mixtral
Ring Attention	Distributes sequence across GPUs in a ring	Llama 4 Scout (10M context)
YaRN / LongRoPE	Extends RoPE to 16–512x original training length	Llama, Qwen, Phi

Provider	Caching Type	Savings
OpenAI	Automatic (hashes prompt prefix)	50% on input tokens
Anthropic	Explicit ( text `cache_control` markers)	90% on cached tokens
Google	Automatic + explicit	75% on cached tokens

Advertised	Effective (Reliable)	Notes
128K	~80–100K	Most models handle well
200K	~130–160K	Performance degrades toward limit
1M	~600–800K	Significant quality drop at extremes
10M	~2–5M (estimated)	Very new, limited benchmarks