What is a context window and how to manage the context window in LLM?
Answer
What Is a Context Window?
A context window is the maximum number of tokens an LLM can process in a single request β including both input tokens (your prompt, system instructions, conversation history) and output tokens (the model's response). Everything outside this window is invisible to the model.
textYou type: "Explain RAG in detail" (5 tokens) + System prompt (500 tokens) + Conversation history (2,000 tokens) + Model response (1,000 tokens) = Total: 3,505 tokens used from context window
Rule of thumb: 1 token β 0.75 English words. So 128K tokens β 96,000 words β a 192-page book.
Per-Prompt or Per-Session?
This is one of the most commonly misunderstood aspects of context windows.
API Usage: Per-Request (Stateless)
Every API call is completely independent. The model has zero memory between requests. You must include the entire conversation history in each request:
pythonfrom openai import OpenAI client = OpenAI() # Each API call must include ALL prior messages messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is RAG?"}, {"role": "assistant", "content": "RAG stands for..."}, # previous response {"role": "user", "content": "How does it differ from fine-tuning?"}, # new question ] # ALL of the above counts against the context window response = client.chat.completions.create(model="gpt-4o", messages=messages)
Chat Interfaces: Per-Conversation (Managed by Provider)
Services like ChatGPT, Claude.ai, and Gemini manage history automatically:
| Interface | Behavior When Limit Approached |
|---|---|
| ChatGPT | Silently truncates older messages or summarizes |
| Claude.ai | Warns user, suggests starting a new conversation |
| Gemini | May summarize older context or return an error |
| Local (Ollama) | Typically truncates from the beginning |
Key Distinction
| Aspect | API (Stateless) | Chat Interface (Managed) |
|---|---|---|
| Scope | Per individual request | Per conversation session |
| Memory | None β you manage history | Provider manages history |
| When exceeded | Returns error immediately | Truncates/summarizes silently |
| Developer control | Full control over what's included | No control |
| Billing | Pay for ALL tokens in every request | Usually subscription-based |
Context Window Limits β All Major LLMs (2025β2026)
OpenAI
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| GPT-4o | 128K | 16,384 | tiktoken ( text |
| GPT-4o-mini | 128K | 16,384 | tiktoken ( text |
| GPT-4.1 | ~1M | 32,768 | tiktoken ( text |
| GPT-4.1-mini | ~1M | 32,768 | tiktoken ( text |
| GPT-4.1-nano | ~1M | 32,768 | tiktoken ( text |
| o1 | 200K | 100,000 | tiktoken ( text |
| o3 | 200K | 100,000 | tiktoken ( text |
| o3-mini | 200K | 100,000 | tiktoken ( text |
| o4-mini | 200K | 100,000 | tiktoken ( text |
Anthropic
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Claude 3.5 Sonnet | 200K | 8,192 | Claude BPE |
| Claude 3.5 Haiku | 200K | 8,192 | Claude BPE |
| Claude Opus 4 | 200K | 32,000 | Claude BPE |
| Claude Sonnet 4 | 200K | 64,000 | Claude BPE |
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Gemini 1.5 Pro | 2M | 8,192 | SentencePiece |
| Gemini 1.5 Flash | 1M | 8,192 | SentencePiece |
| Gemini 2.0 Flash | 1M | 8,192 | SentencePiece |
| Gemini 2.5 Pro | 1M | 65,536 | SentencePiece |
| Gemini 2.5 Flash | 1M | 65,536 | SentencePiece |
Meta (Open Source)
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Llama 3.1 (8B/70B/405B) | 128K | Configurable | BPE (128K vocab) |
| Llama 3.3 70B | 128K | 32,768 | BPE (128K vocab) |
| Llama 4 Scout (109B MoE) | 10M | ~32,768 | BPE (200K vocab) |
| Llama 4 Maverick (400B MoE) | 1M | ~32,768 | BPE (200K vocab) |
Mistral
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Mistral Large 2 | 128K | ~8,192 | SentencePiece BPE |
| Mistral Small 3.2 | 128K | ~8,192 | SentencePiece BPE |
| Mixtral 8x22B | 64K | ~4,096 | SentencePiece BPE |
DeepSeek
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| DeepSeek-V3 | 128K | 8,000 | Custom BBPE (100K vocab) |
| DeepSeek-R1 | 128K | 32,768 | Custom BBPE (100K vocab) |
Others
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Grok-2 | 131K | 131K | Custom BPE |
| Grok-3 | 131K | ~32,768 | Custom BPE |
| Command R+ | 128K | 4,000 | Custom BPE |
| Command A | 256K | 8,000 | Custom BPE |
| Qwen 2.5 Turbo | 1M | ~8,192 | Custom BPE (152K vocab) |
| Qwen 3 | 32Kβ131K | ~8,192 | Custom BPE |
Why Context Windows Are Limited
1. Quadratic Attention Complexity β O(nΒ²)
Self-attention computes an
n Γ n| Context Length | Attention Matrix Size | Relative Cost |
|---|---|---|
| 4K | 16M entries | 1x |
| 32K | 1B entries | 64x |
| 128K | 16B entries | 1,024x |
| 1M | 1T entries | 62,500x |
2. KV Cache Memory
During generation, Key/Value vectors for every token are cached:
| Model Size | Context | Approximate KV Cache (FP16) |
|---|---|---|
| 7B | 4K | ~1 GB |
| 7B | 128K | ~32 GB |
| 70B | 128K | ~160 GB |
| 70B | 1M | ~1.2 TB |
3. Positional Encoding Limits
Models trained with RoPE (Rotary Position Embeddings) degrade beyond their training length. Extending requires techniques like YaRN or LongRoPE.
Solutions Used by Modern Models
| Technique | How It Works | Used By |
|---|---|---|
| Flash Attention | Memory-efficient exact attention via tiling β O(n) memory | Nearly all modern LLMs |
| Grouped Query Attention (GQA) | Multiple query heads share K/V heads, reducing KV cache 4β8x | Llama 3, Mistral, Gemini |
| Multi-Head Latent Attention (MLA) | Compresses K/V into low-rank latent space | DeepSeek-V3 |
| Sliding Window Attention | Each layer attends to only a fixed window | Mistral, Mixtral |
| Ring Attention | Distributes sequence across GPUs in a ring | Llama 4 Scout (10M context) |
| YaRN / LongRoPE | Extends RoPE to 16β512x original training length | Llama, Qwen, Phi |
Context Window Management Strategies
Strategy 1: Token Counting Before Sending
Always count tokens before making API calls to avoid errors.
pythonimport tiktoken def check_fits_context( messages: list[dict], model: str = "gpt-4o", max_context: int = 128_000, reserve_for_output: int = 4_096 ) -> bool: enc = tiktoken.encoding_for_model(model) total = sum(len(enc.encode(m["content"])) for m in messages) available = max_context - reserve_for_output print(f"Using {total}/{available} tokens ({total/available*100:.1f}%)") return total <= available
Strategy 2: Sliding Window Truncation
Keep only the most recent messages that fit.
pythondef sliding_window_truncate(messages, max_tokens=120000): """Keep system prompt + most recent messages within budget.""" enc = tiktoken.encoding_for_model("gpt-4o") system = [m for m in messages if m["role"] == "system"] others = [m for m in messages if m["role"] != "system"] budget = max_tokens - sum(len(enc.encode(m["content"])) for m in system) kept = [] for msg in reversed(others): msg_tokens = len(enc.encode(msg["content"])) if budget - msg_tokens < 0: break kept.insert(0, msg) budget -= msg_tokens return system + kept
Strategy 3: Summarize Older Messages
Compress old history using a smaller/cheaper model.
pythonasync def summarize_and_compact(messages, client): """Summarize old messages, keep recent ones verbatim.""" old_messages = messages[:-10] # older history recent = messages[-10:] # keep last 10 turns summary_prompt = "Summarize this conversation concisely:\n" for m in old_messages: summary_prompt += f"{m['role']}: {m['content']}\n" summary = await client.chat.completions.create( model="gpt-4o-mini", # use cheap model for summarization messages=[{"role": "user", "content": summary_prompt}], max_tokens=500 ) return [ {"role": "system", "content": f"Previous conversation summary: {summary.choices[0].message.content}"}, *recent ]
Strategy 4: RAG (Retrieval Augmented Generation)
Store information externally, retrieve only what's relevant.
pythonfrom langchain.vectorstores import Chroma from langchain.embeddings import OpenAIEmbeddings # Store documents in vector DB (once) vectorstore = Chroma.from_documents(documents, OpenAIEmbeddings()) # At query time, retrieve only relevant chunks relevant_docs = vectorstore.similarity_search(user_query, k=5) context = "\n".join([doc.page_content for doc in relevant_docs]) # Inject only relevant context into prompt messages = [ {"role": "system", "content": f"Answer using this context:\n{context}"}, {"role": "user", "content": user_query} ]
Strategy 5: Hierarchical Context Management
Layer different types of context at different granularities:
textβββββββββββββββββββββββββββββββββββββββββββββββ β Layer 1: System Prompt (always present) β ~500 tokens β Role, rules, formatting instructions β βββββββββββββββββββββββββββββββββββββββββββββββ€ β Layer 2: Long-term Memory (compact) β ~500 tokens β User preferences, key facts β βββββββββββββββββββββββββββββββββββββββββββββββ€ β Layer 3: Session Summary (compressed) β ~1,000 tokens β Condensed history of current conversation β βββββββββββββββββββββββββββββββββββββββββββββββ€ β Layer 4: Recent Messages (full detail) β ~5,000 tokens β Last 5-10 turns verbatim β βββββββββββββββββββββββββββββββββββββββββββββββ€ β Layer 5: Retrieved Context (dynamic) β ~3,000 tokens β RAG results relevant to current query β βββββββββββββββββββββββββββββββββββββββββββββββ€ β Layer 6: Reserved for Output β ~4,096 tokens β Space for model's response β βββββββββββββββββββββββββββββββββββββββββββββββ
Strategy 6: Prompt Caching
All major providers now offer caching to reduce cost and latency for repeated context:
| Provider | Caching Type | Savings |
|---|---|---|
| OpenAI | Automatic (hashes prompt prefix) | 50% on input tokens |
| Anthropic | Explicit ( text | 90% on cached tokens |
| Automatic + explicit | 75% on cached tokens |
Best practice: Place static content (system prompts, reference documents) at the start of the prompt and dynamic content at the end to maximize cache hits.
The "Lost in the Middle" Problem
Research shows LLMs have a U-shaped performance curve β they recall information best from the beginning and end of the context, with 30%+ degradation for content in the middle.
textRecall Performance by Position: High ββββββββββββββββββββββββββββββββ Low β² Beginning Middle End β Worst recall here
Mitigation: Place the most critical information at the start or end of your prompt β never buried in the middle.
Effective vs. Advertised Context Length
| Advertised | Effective (Reliable) | Notes |
|---|---|---|
| 128K | ~80β100K | Most models handle well |
| 200K | ~130β160K | Performance degrades toward limit |
| 1M | ~600β800K | Significant quality drop at extremes |
| 10M | ~2β5M (estimated) | Very new, limited benchmarks |
Key insight: Advertised context β effective context. Plan for 60β80% of the advertised maximum for reliable performance.