What is a context window and how to manage the context window in LLM?

#gen-ai#llm#context-window#tokens#rag#prompt-engineering#memory-management

Answer

What Is a Context Window?

A context window is the maximum number of tokens an LLM can process in a single request β€” including both input tokens (your prompt, system instructions, conversation history) and output tokens (the model's response). Everything outside this window is invisible to the model.

ContextΒ Window=InputΒ Tokens+OutputΒ Tokens\text{Context Window} = \text{Input Tokens} + \text{Output Tokens}

text
You type: "Explain RAG in detail" (5 tokens)
   +
System prompt (500 tokens)
   +
Conversation history (2,000 tokens)
   +
Model response (1,000 tokens)
   =
Total: 3,505 tokens used from context window

Rule of thumb: 1 token β‰ˆ 0.75 English words. So 128K tokens β‰ˆ 96,000 words β‰ˆ a 192-page book.


Per-Prompt or Per-Session?

This is one of the most commonly misunderstood aspects of context windows.

API Usage: Per-Request (Stateless)

Every API call is completely independent. The model has zero memory between requests. You must include the entire conversation history in each request:

python
from openai import OpenAI
client = OpenAI()

# Each API call must include ALL prior messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is RAG?"},
    {"role": "assistant", "content": "RAG stands for..."},     # previous response
    {"role": "user", "content": "How does it differ from fine-tuning?"},  # new question
]
# ALL of the above counts against the context window
response = client.chat.completions.create(model="gpt-4o", messages=messages)

Chat Interfaces: Per-Conversation (Managed by Provider)

Services like ChatGPT, Claude.ai, and Gemini manage history automatically:

InterfaceBehavior When Limit Approached
ChatGPTSilently truncates older messages or summarizes
Claude.aiWarns user, suggests starting a new conversation
GeminiMay summarize older context or return an error
Local (Ollama)Typically truncates from the beginning

Key Distinction

AspectAPI (Stateless)Chat Interface (Managed)
ScopePer individual requestPer conversation session
MemoryNone β€” you manage historyProvider manages history
When exceededReturns error immediatelyTruncates/summarizes silently
Developer controlFull control over what's includedNo control
BillingPay for ALL tokens in every requestUsually subscription-based

Context Window Limits β€” All Major LLMs (2025–2026)

OpenAI

ModelContext WindowMax OutputTokenizer
GPT-4o128K16,384tiktoken (
text
o200k_base
)
GPT-4o-mini128K16,384tiktoken (
text
o200k_base
)
GPT-4.1~1M32,768tiktoken (
text
o200k_base
)
GPT-4.1-mini~1M32,768tiktoken (
text
o200k_base
)
GPT-4.1-nano~1M32,768tiktoken (
text
o200k_base
)
o1200K100,000tiktoken (
text
o200k_base
)
o3200K100,000tiktoken (
text
o200k_base
)
o3-mini200K100,000tiktoken (
text
o200k_base
)
o4-mini200K100,000tiktoken (
text
o200k_base
)

Anthropic

ModelContext WindowMax OutputTokenizer
Claude 3.5 Sonnet200K8,192Claude BPE
Claude 3.5 Haiku200K8,192Claude BPE
Claude Opus 4200K32,000Claude BPE
Claude Sonnet 4200K64,000Claude BPE

Google

ModelContext WindowMax OutputTokenizer
Gemini 1.5 Pro2M8,192SentencePiece
Gemini 1.5 Flash1M8,192SentencePiece
Gemini 2.0 Flash1M8,192SentencePiece
Gemini 2.5 Pro1M65,536SentencePiece
Gemini 2.5 Flash1M65,536SentencePiece

Meta (Open Source)

ModelContext WindowMax OutputTokenizer
Llama 3.1 (8B/70B/405B)128KConfigurableBPE (128K vocab)
Llama 3.3 70B128K32,768BPE (128K vocab)
Llama 4 Scout (109B MoE)10M~32,768BPE (200K vocab)
Llama 4 Maverick (400B MoE)1M~32,768BPE (200K vocab)

Mistral

ModelContext WindowMax OutputTokenizer
Mistral Large 2128K~8,192SentencePiece BPE
Mistral Small 3.2128K~8,192SentencePiece BPE
Mixtral 8x22B64K~4,096SentencePiece BPE

DeepSeek

ModelContext WindowMax OutputTokenizer
DeepSeek-V3128K8,000Custom BBPE (100K vocab)
DeepSeek-R1128K32,768Custom BBPE (100K vocab)

Others

ModelContext WindowMax OutputTokenizer
Grok-2131K131KCustom BPE
Grok-3131K~32,768Custom BPE
Command R+128K4,000Custom BPE
Command A256K8,000Custom BPE
Qwen 2.5 Turbo1M~8,192Custom BPE (152K vocab)
Qwen 332K–131K~8,192Custom BPE

Why Context Windows Are Limited

1. Quadratic Attention Complexity β€” O(nΒ²)

Self-attention computes an

text
n Γ— n
attention matrix. Doubling context quadruples compute:

Context LengthAttention Matrix SizeRelative Cost
4K16M entries1x
32K1B entries64x
128K16B entries1,024x
1M1T entries62,500x

2. KV Cache Memory

During generation, Key/Value vectors for every token are cached:

Model SizeContextApproximate KV Cache (FP16)
7B4K~1 GB
7B128K~32 GB
70B128K~160 GB
70B1M~1.2 TB

3. Positional Encoding Limits

Models trained with RoPE (Rotary Position Embeddings) degrade beyond their training length. Extending requires techniques like YaRN or LongRoPE.

Solutions Used by Modern Models

TechniqueHow It WorksUsed By
Flash AttentionMemory-efficient exact attention via tiling β€” O(n) memoryNearly all modern LLMs
Grouped Query Attention (GQA)Multiple query heads share K/V heads, reducing KV cache 4–8xLlama 3, Mistral, Gemini
Multi-Head Latent Attention (MLA)Compresses K/V into low-rank latent spaceDeepSeek-V3
Sliding Window AttentionEach layer attends to only a fixed windowMistral, Mixtral
Ring AttentionDistributes sequence across GPUs in a ringLlama 4 Scout (10M context)
YaRN / LongRoPEExtends RoPE to 16–512x original training lengthLlama, Qwen, Phi

Context Window Management Strategies

Strategy 1: Token Counting Before Sending

Always count tokens before making API calls to avoid errors.

python
import tiktoken

def check_fits_context(
    messages: list[dict],
    model: str = "gpt-4o",
    max_context: int = 128_000,
    reserve_for_output: int = 4_096
) -> bool:
    enc = tiktoken.encoding_for_model(model)
    total = sum(len(enc.encode(m["content"])) for m in messages)
    available = max_context - reserve_for_output
    print(f"Using {total}/{available} tokens ({total/available*100:.1f}%)")
    return total <= available

Strategy 2: Sliding Window Truncation

Keep only the most recent messages that fit.

python
def sliding_window_truncate(messages, max_tokens=120000):
    """Keep system prompt + most recent messages within budget."""
    enc = tiktoken.encoding_for_model("gpt-4o")
    system = [m for m in messages if m["role"] == "system"]
    others = [m for m in messages if m["role"] != "system"]

    budget = max_tokens - sum(len(enc.encode(m["content"])) for m in system)
    kept = []

    for msg in reversed(others):
        msg_tokens = len(enc.encode(msg["content"]))
        if budget - msg_tokens < 0:
            break
        kept.insert(0, msg)
        budget -= msg_tokens

    return system + kept

Strategy 3: Summarize Older Messages

Compress old history using a smaller/cheaper model.

python
async def summarize_and_compact(messages, client):
    """Summarize old messages, keep recent ones verbatim."""
    old_messages = messages[:-10]   # older history
    recent = messages[-10:]          # keep last 10 turns

    summary_prompt = "Summarize this conversation concisely:\n"
    for m in old_messages:
        summary_prompt += f"{m['role']}: {m['content']}\n"

    summary = await client.chat.completions.create(
        model="gpt-4o-mini",  # use cheap model for summarization
        messages=[{"role": "user", "content": summary_prompt}],
        max_tokens=500
    )

    return [
        {"role": "system", "content": f"Previous conversation summary: {summary.choices[0].message.content}"},
        *recent
    ]

Strategy 4: RAG (Retrieval Augmented Generation)

Store information externally, retrieve only what's relevant.

python
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# Store documents in vector DB (once)
vectorstore = Chroma.from_documents(documents, OpenAIEmbeddings())

# At query time, retrieve only relevant chunks
relevant_docs = vectorstore.similarity_search(user_query, k=5)
context = "\n".join([doc.page_content for doc in relevant_docs])

# Inject only relevant context into prompt
messages = [
    {"role": "system", "content": f"Answer using this context:\n{context}"},
    {"role": "user", "content": user_query}
]

Strategy 5: Hierarchical Context Management

Layer different types of context at different granularities:

text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Layer 1: System Prompt (always present)      β”‚  ~500 tokens
β”‚   Role, rules, formatting instructions       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Layer 2: Long-term Memory (compact)          β”‚  ~500 tokens
β”‚   User preferences, key facts                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Layer 3: Session Summary (compressed)        β”‚  ~1,000 tokens
β”‚   Condensed history of current conversation  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Layer 4: Recent Messages (full detail)       β”‚  ~5,000 tokens
β”‚   Last 5-10 turns verbatim                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Layer 5: Retrieved Context (dynamic)         β”‚  ~3,000 tokens
β”‚   RAG results relevant to current query      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Layer 6: Reserved for Output                 β”‚  ~4,096 tokens
β”‚   Space for model's response                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Strategy 6: Prompt Caching

All major providers now offer caching to reduce cost and latency for repeated context:

ProviderCaching TypeSavings
OpenAIAutomatic (hashes prompt prefix)50% on input tokens
AnthropicExplicit (
text
cache_control
markers)
90% on cached tokens
GoogleAutomatic + explicit75% on cached tokens

Best practice: Place static content (system prompts, reference documents) at the start of the prompt and dynamic content at the end to maximize cache hits.


The "Lost in the Middle" Problem

Research shows LLMs have a U-shaped performance curve β€” they recall information best from the beginning and end of the context, with 30%+ degradation for content in the middle.

text
Recall Performance by Position:

  High  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  Low               β–²
        Beginning  Middle        End
                    ↑
           Worst recall here

Mitigation: Place the most critical information at the start or end of your prompt β€” never buried in the middle.


Effective vs. Advertised Context Length

AdvertisedEffective (Reliable)Notes
128K~80–100KMost models handle well
200K~130–160KPerformance degrades toward limit
1M~600–800KSignificant quality drop at extremes
10M~2–5M (estimated)Very new, limited benchmarks

Key insight: Advertised context β‰  effective context. Plan for 60–80% of the advertised maximum for reliable performance.


Context Management Decision Tree