Concept #118Mediumextended-ai-concepts

What is Cache Hit vs Cache Miss in AI input tokens?

#gen-ai#tokens#mlops

Answer

Cache Hit vs Cache Miss in AI Input Tokens

Prompt caching is a feature that allows AI providers to cache repeated input token sequences, so you only pay full price once and get a discount on subsequent uses. A cache hit means the cached content was reused; a cache miss means it had to be processed fresh.

Why Prompt Caching Exists

Many AI applications send the same content repeatedly:

  • A long system prompt that never changes
  • A large document being analyzed across many queries
  • Library documentation used in every coding session

Without caching, you pay full token price every time. With caching, you pay once and subsequent calls are ~90% cheaper.

Anthropic Cache Pricing

Token TypePrice (Claude 3.5 Sonnet)
Regular input$3.00 / million tokens
Cache write$3.75 / million tokens (25% more — stores in cache)
Cache read (hit)$0.30 / million tokens (90% cheaper than regular!)

Cache Hit vs Cache Miss

Cache MissCache Hit
What happensContent processed freshStored KV cache is reused
ProcessingFull transformer computationSkip recomputation
CostFull input token price~10% of regular price
LatencyHigher (full processing)Lower (cache lookup)
WhenFirst request, cache expiredSubsequent requests within TTL

Implementing Prompt Caching (Anthropic)

python
from anthropic import Anthropic

client = Anthropic()

# Large document that stays the same across queries
LARGE_DOCUMENT = open("company_policies.txt").read()  # 50,000 tokens

def query_with_cache(question: str) -> str:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": LARGE_DOCUMENT,
                "cache_control": {"type": "ephemeral"}  # Mark for caching
            },
            {
                "type": "text",
                "text": "Answer questions about company policies accurately."
            }
        ],
        messages=[{"role": "user", "content": question}]
    )

    # Check cache status in response
    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")    # Cache HIT
    print(f"Cache write tokens: {usage.cache_creation_input_tokens}") # Cache MISS (first time)

    return response.content[0].text

# First call: cache MISS — writes to cache (costs slightly more)
query_with_cache("What is the vacation policy?")

# Second call: cache HIT — reads from cache (90% cheaper)
query_with_cache("How do I file an expense report?")

Cache Lifetime

ProviderCache TTL
Anthropic5 minutes (ephemeral)
OpenAI1 hour (automatic)

After TTL expires, the next request is a cache miss again and rebuilds the cache.

When Cache Hits Happen

text
Request 1: [SystemPrompt + DOCUMENT + Q1] → MISS (writes cache for DOCUMENT)
Request 2: [SystemPrompt + DOCUMENT + Q2] → HIT (DOCUMENT reused from cache)
Request 3: [SystemPrompt + DOCUMENT + Q3] → HIT
... 6 minutes later (cache expired) ...
Request N: [SystemPrompt + DOCUMENT + QN] → MISS again

Cost Savings Example

python
# 50,000 token document, queried 100 times per day
tokens = 50_000
queries = 100
regular_cost = tokens * queries / 1_000_000 * 3.00   # $15/day

# With caching
cache_write = tokens / 1_000_000 * 3.75               # $0.1875 (once per 5 min)
cache_reads = tokens * (queries - 12) / 1_000_000 * 0.30  # ~$1.32/day (rest are hits)
cached_cost = cache_write * 12 + cache_reads           # ~$3.54/day

print(f"Savings: ${regular_cost - cached_cost:.2f}/day")  # ~$11.46/day savings

OpenAI Automatic Caching

OpenAI applies caching automatically — no explicit markup needed:

python
# OpenAI caches the longest prefix of input tokens automatically
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[long_system + long_context + short_question]
)
# If the long prefix matches a cached request, it's a cache hit
print(response.usage.prompt_tokens_details.cached_tokens)  # Cache hit tokens