What is Cache Hit vs Cache Miss in AI input tokens?

Question

Accepted Answer

## Cache Hit vs Cache Miss in AI Input Tokens

**Prompt caching** is a feature that allows AI providers to cache repeated input token sequences, so you only pay full price once and get a discount on subsequent uses. A **cache hit** means the cached content was reused; a **cache miss** means it had to be processed fresh.

### Why Prompt Caching Exists

Many AI applications send the same content repeatedly:
- A long system prompt that never changes
- A large document being analyzed across many queries
- Library documentation used in every coding session

Without caching, you pay full token price every time. With caching, you pay once and subsequent calls are ~90% cheaper.

### Anthropic Cache Pricing

| Token Type | Price (Claude 3.5 Sonnet) |
|-----------|--------------------------|
| Regular input | $3.00 / million tokens |
| Cache write | $3.75 / million tokens (25% more — stores in cache) |
| Cache read (hit) | $0.30 / million tokens (90% cheaper than regular!) |

### Cache Hit vs Cache Miss

| | Cache Miss | Cache Hit |
|--|-----------|---------|
| **What happens** | Content processed fresh | Stored KV cache is reused |
| **Processing** | Full transformer computation | Skip recomputation |
| **Cost** | Full input token price | ~10% of regular price |
| **Latency** | Higher (full processing) | Lower (cache lookup) |
| **When** | First request, cache expired | Subsequent requests within TTL |

### Implementing Prompt Caching (Anthropic)

```python
from anthropic import Anthropic

client = Anthropic()

# Large document that stays the same across queries
LARGE_DOCUMENT = open("company_policies.txt").read()  # 50,000 tokens

def query_with_cache(question: str) -> str:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": LARGE_DOCUMENT,
                "cache_control": {"type": "ephemeral"}  # Mark for caching
            },
            {
                "type": "text",
                "text": "Answer questions about company policies accurately."
            }
        ],
        messages=[{"role": "user", "content": question}]
    )

# Check cache status in response
    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")    # Cache HIT
    print(f"Cache write tokens: {usage.cache_creation_input_tokens}") # Cache MISS (first time)

return response.content[0].text

# First call: cache MISS — writes to cache (costs slightly more)
query_with_cache("What is the vacation policy?")

# Second call: cache HIT — reads from cache (90% cheaper)
query_with_cache("How do I file an expense report?")
```

### Cache Lifetime

| Provider | Cache TTL |
|---------|----------|
| Anthropic | 5 minutes (ephemeral) |
| OpenAI | 1 hour (automatic) |

After TTL expires, the next request is a cache miss again and rebuilds the cache.

### When Cache Hits Happen

```
Request 1: [SystemPrompt + DOCUMENT + Q1] → MISS (writes cache for DOCUMENT)
Request 2: [SystemPrompt + DOCUMENT + Q2] → HIT (DOCUMENT reused from cache)
Request 3: [SystemPrompt + DOCUMENT + Q3] → HIT
... 6 minutes later (cache expired) ...
Request N: [SystemPrompt + DOCUMENT + QN] → MISS again
```

### Cost Savings Example

```python
# 50,000 token document, queried 100 times per day
tokens = 50_000
queries = 100
regular_cost = tokens * queries / 1_000_000 * 3.00   # $15/day

# With caching
cache_write = tokens / 1_000_000 * 3.75               # $0.1875 (once per 5 min)
cache_reads = tokens * (queries - 12) / 1_000_000 * 0.30  # ~$1.32/day (rest are hits)
cached_cost = cache_write * 12 + cache_reads           # ~$3.54/day

print(f"Savings: ${regular_cost - cached_cost:.2f}/day")  # ~$11.46/day savings
```

### OpenAI Automatic Caching

OpenAI applies caching automatically — no explicit markup needed:

```python
# OpenAI caches the longest prefix of input tokens automatically
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[long_system + long_context + short_question]
)
# If the long prefix matches a cached request, it's a cache hit
print(response.usage.prompt_tokens_details.cached_tokens)  # Cache hit tokens
```

What is Cache Hit vs Cache Miss in AI input tokens?

Answer

Cache Hit vs Cache Miss in AI Input Tokens

Why Prompt Caching Exists

Anthropic Cache Pricing

Cache Hit vs Cache Miss

Implementing Prompt Caching (Anthropic)

Cache Lifetime

When Cache Hits Happen

Cost Savings Example

OpenAI Automatic Caching

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Token Type	Price (Claude 3.5 Sonnet)
Regular input	$3.00 / million tokens
Cache write	$3.75 / million tokens (25% more — stores in cache)
Cache read (hit)	$0.30 / million tokens (90% cheaper than regular!)

	Cache Miss	Cache Hit
What happens	Content processed fresh	Stored KV cache is reused
Processing	Full transformer computation	Skip recomputation
Cost	Full input token price	~10% of regular price
Latency	Higher (full processing)	Lower (cache lookup)
When	First request, cache expired	Subsequent requests within TTL