What is Cache Hit vs Cache Miss in AI input tokens?
#gen-ai#tokens#mlops
Answer
Cache Hit vs Cache Miss in AI Input Tokens
Prompt caching is a feature that allows AI providers to cache repeated input token sequences, so you only pay full price once and get a discount on subsequent uses. A cache hit means the cached content was reused; a cache miss means it had to be processed fresh.
Why Prompt Caching Exists
Many AI applications send the same content repeatedly:
- A long system prompt that never changes
- A large document being analyzed across many queries
- Library documentation used in every coding session
Without caching, you pay full token price every time. With caching, you pay once and subsequent calls are ~90% cheaper.
Anthropic Cache Pricing
| Token Type | Price (Claude 3.5 Sonnet) |
|---|---|
| Regular input | $3.00 / million tokens |
| Cache write | $3.75 / million tokens (25% more — stores in cache) |
| Cache read (hit) | $0.30 / million tokens (90% cheaper than regular!) |
Cache Hit vs Cache Miss
| Cache Miss | Cache Hit | |
|---|---|---|
| What happens | Content processed fresh | Stored KV cache is reused |
| Processing | Full transformer computation | Skip recomputation |
| Cost | Full input token price | ~10% of regular price |
| Latency | Higher (full processing) | Lower (cache lookup) |
| When | First request, cache expired | Subsequent requests within TTL |
Implementing Prompt Caching (Anthropic)
pythonfrom anthropic import Anthropic client = Anthropic() # Large document that stays the same across queries LARGE_DOCUMENT = open("company_policies.txt").read() # 50,000 tokens def query_with_cache(question: str) -> str: response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system=[ { "type": "text", "text": LARGE_DOCUMENT, "cache_control": {"type": "ephemeral"} # Mark for caching }, { "type": "text", "text": "Answer questions about company policies accurately." } ], messages=[{"role": "user", "content": question}] ) # Check cache status in response usage = response.usage print(f"Input tokens: {usage.input_tokens}") print(f"Cache read tokens: {usage.cache_read_input_tokens}") # Cache HIT print(f"Cache write tokens: {usage.cache_creation_input_tokens}") # Cache MISS (first time) return response.content[0].text # First call: cache MISS — writes to cache (costs slightly more) query_with_cache("What is the vacation policy?") # Second call: cache HIT — reads from cache (90% cheaper) query_with_cache("How do I file an expense report?")
Cache Lifetime
| Provider | Cache TTL |
|---|---|
| Anthropic | 5 minutes (ephemeral) |
| OpenAI | 1 hour (automatic) |
After TTL expires, the next request is a cache miss again and rebuilds the cache.
When Cache Hits Happen
textRequest 1: [SystemPrompt + DOCUMENT + Q1] → MISS (writes cache for DOCUMENT) Request 2: [SystemPrompt + DOCUMENT + Q2] → HIT (DOCUMENT reused from cache) Request 3: [SystemPrompt + DOCUMENT + Q3] → HIT ... 6 minutes later (cache expired) ... Request N: [SystemPrompt + DOCUMENT + QN] → MISS again
Cost Savings Example
python# 50,000 token document, queried 100 times per day tokens = 50_000 queries = 100 regular_cost = tokens * queries / 1_000_000 * 3.00 # $15/day # With caching cache_write = tokens / 1_000_000 * 3.75 # $0.1875 (once per 5 min) cache_reads = tokens * (queries - 12) / 1_000_000 * 0.30 # ~$1.32/day (rest are hits) cached_cost = cache_write * 12 + cache_reads # ~$3.54/day print(f"Savings: ${regular_cost - cached_cost:.2f}/day") # ~$11.46/day savings
OpenAI Automatic Caching
OpenAI applies caching automatically — no explicit markup needed:
python# OpenAI caches the longest prefix of input tokens automatically response = client.chat.completions.create( model="gpt-4o", messages=[long_system + long_context + short_question] ) # If the long prefix matches a cached request, it's a cache hit print(response.usage.prompt_tokens_details.cached_tokens) # Cache hit tokens