All issues
Issue #0022 min read

How prompt caching paid for itself in 4 days

A client's lead-qualification workflow was burning $40/day in Claude API calls. One config change took it to $6/day with no quality loss. Here's exactly what changed.

A D2C client (the skincare one — you'll see the case study on the site) was processing ~3,000 inbound leads a day through Claude. Each lead carried a ~4,000-token system prompt — brand voice, product catalog, response rules.

The math was simple and ugly:

3,000 leads/day × 4,000 input tokens × $3/M = $36/day

Plus output. Call it $40/day, $1,200/month. For a workflow that handled roughly a quarter of their daily volume.

What changed

One field. On the system prompt block:

{
  "system": [{
    "type": "text",
    "text": SYSTEM_PROMPT,
    "cache_control": { "type": "ephemeral" }
  }],
  "messages": [...]
}

That's it. The first request of a 5-minute window pays the cache-write premium (1.25× input cost). Every subsequent request in that window pays about 0.1× of the cached portion.

For this client, with ~3,000 requests/day clustered into work hours, the math became:

~36 cache writes/day × 4,000 tokens × $3/M × 1.25 = ~$0.54
~2,964 cache reads/day × 4,000 tokens × $3/M × 0.1 = ~$3.56
≈ $4.10/day for the cached portion

Plus the uncached per-lead message + output. New total: about $6/day. An 86% drop. No quality change. Four lines of JSON.

The catch

Prompt caching is a prefix match. Any byte change anywhere in the prefix invalidates the cache from that byte onward. The classic silent invalidators:

  • A timestamp interpolated into the system prompt header ("Current time: 2026-05-24T14:05:22Z"). New prefix every second.
  • JSON.stringify(config) without sorted keys. Iteration order varies.
  • A per-request UUID added early in the prompt.
  • The tool list quietly varying per user.

The fix: keep the early prefix stable. Put timestamps and per-user context after the last cache_control breakpoint, or in a separate user message.

How to verify it's working

Read response.usage after each call:

FieldWhat it means
cache_creation_input_tokensTokens written this request (1.25× cost)
cache_read_input_tokensTokens served from cache (0.1× cost)
input_tokensTokens processed at full price

If cache_read_input_tokens is zero across repeated requests with what you believe is the same prompt, something is silently invalidating. Diff two rendered prompts byte-by-byte. The bug will be there.

That's the play. See you next Sunday.

— Pavan

The Automation Architect

Weekly automation playbooks for founders and school leaders.

One issue every Sunday. Real systems, real numbers, copy-paste-ready. No fluff. Unsubscribe anytime.