Skip to content

Free tool

Prompt Caching Savings Calculator

Prompt caching bills the repeated, static part of your prompt at a steep discount. See how much it cuts your monthly input-token cost — set your usage and your provider's cache-read discount to compare the bill with and without caching.

Input cost · no caching

$1200/mo

With caching

$444/mo

You save

$756/mo

63% off input

Caching only discounts the staticpart of a prompt that repeats across calls — the system prompt, few-shot examples, and any fixed long context. The dynamic part (the user's actual query) is billed in full, and output tokens are never cached. The bigger and more-reused your fixed prefix, the more you save — which is exactly why agents and RAG pipelines benefit most.

Sizing an agent's bill? Pair this with the AI Agent Cost Calculator.

Get the brief

Directional estimate. Real savings depend on cache hit-rate and TTL (caches expire, ~5 min on some providers) and a small one-time cache-WRITE surcharge on the first call — confirm the exact multipliers with your provider.

Frequently asked

How much does prompt caching save?

It depends on two things: how much of your prompt is static (repeated across calls) and how big your provider's cache-read discount is. Caching only discounts the reused prefix — the system prompt, few-shot examples, and any fixed long context — so if 70% of your prompt is static and reads are 90% cheaper, you cut roughly 63% off your input bill. Set your numbers above for an estimate. Output tokens are never cached and aren't affected.

What's the cache-read discount for Anthropic, OpenAI, and Gemini?

Ballparks (confirm current values with each provider, as they change): Anthropic cache reads are about 90% cheaper than base input (with a ~25% surcharge the first time the cache is written); OpenAI cached input is roughly 50% off (more on some models); Google Gemini offers context caching with its own per-model pricing. This tool lets you set the discount directly so it stays accurate for whichever provider and model you use.

What is prompt caching and when is it worth it?

Providers can store the processed form of a repeated prompt prefix so subsequent calls skip re-processing it, billing those tokens at a steep discount. It's worth it whenever you send the same large static context many times within the cache's lifetime — agents (which re-send a growing history every step), RAG systems with a fixed instruction block, chatbots with a big system prompt, or batch jobs over a shared document. If every prompt is unique, caching won't help.

Why doesn't caching discount the whole prompt?

Only the stable prefix can be cached. The model caches everything up to the first point your prompt changes, so the dynamic tail — the user's actual question, the current tool result — is always billed at full price, and you only benefit if the static part comes first and is reused. Structuring prompts with the fixed content at the top maximizes the cacheable share.

Are there downsides or limits?

Caches expire (a short TTL, around 5 minutes on some providers), so infrequent traffic may never hit a warm cache. There's also usually a small one-time write surcharge on the first call that populates the cache. Net, caching is a clear win for high-frequency, large-static-prefix workloads and roughly neutral for sparse, all-unique traffic — this calculator helps you tell which side you're on.

Newsletter

Liked the tool? Get the signal.

One weekly email on the AI + hardware that actually matters — from the people who build these calculators.

Free · unsubscribe anytime · no spam.