A D2C client (the skincare one — you'll see the case study on the site) was processing ~3,000 inbound leads a day through Claude. Each lead carried a ~4,000-token system prompt — brand voice, product catalog, response rules.
The math was simple and ugly:
3,000 leads/day × 4,000 input tokens × $3/M = $36/dayPlus output. Call it $40/day, $1,200/month. For a workflow that handled roughly a quarter of their daily volume.
What changed
One field. On the system prompt block:
{
"system": [{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": { "type": "ephemeral" }
}],
"messages": [...]
}That's it. The first request of a 5-minute window pays the cache-write premium (1.25× input cost). Every subsequent request in that window pays about 0.1× of the cached portion.
For this client, with ~3,000 requests/day clustered into work hours, the math became:
~36 cache writes/day × 4,000 tokens × $3/M × 1.25 = ~$0.54
~2,964 cache reads/day × 4,000 tokens × $3/M × 0.1 = ~$3.56
≈ $4.10/day for the cached portionPlus the uncached per-lead message + output. New total: about $6/day. An 86% drop. No quality change. Four lines of JSON.
The catch
Prompt caching is a prefix match. Any byte change anywhere in the prefix invalidates the cache from that byte onward. The classic silent invalidators:
- A timestamp interpolated into the system prompt header (
"Current time: 2026-05-24T14:05:22Z"). New prefix every second. JSON.stringify(config)without sorted keys. Iteration order varies.- A per-request UUID added early in the prompt.
- The tool list quietly varying per user.
The fix: keep the early prefix stable. Put timestamps and per-user context
after the last cache_control breakpoint, or in a separate user message.
How to verify it's working
Read response.usage after each call:
| Field | What it means |
|---|---|
cache_creation_input_tokens | Tokens written this request (1.25× cost) |
cache_read_input_tokens | Tokens served from cache (0.1× cost) |
input_tokens | Tokens processed at full price |
If cache_read_input_tokens is zero across repeated requests with what you
believe is the same prompt, something is silently invalidating. Diff two
rendered prompts byte-by-byte. The bug will be there.
That's the play. See you next Sunday.
— Pavan