Let me ask this straight - model providers charge for
Input tokens - understandable because LLMs have to process this information, pass it through their gpus and stuff
Output tokens - understandable because LLMs use gpus and stuff
Cache write - why?
Cache read - why?
In a case of cache - why are we being charged for cache read and write - when we have no control over it? It’s a model provider and inferencer game - I am not sure why the costs for that are being passed down to the consumer?
Caching in any application till today came in built-in - your client asks you to make your product fast - you use cache. You don’t pass on that cost to your client, cache is just RAM being used on the servers and we are paying for writing and reading stuff from RAM? that does not make any sense.
What’s next? They charge us for bandwidth? Can someone explain this please I could not find anything online as to why are we being charged for LLM caching when its supposed to be a part of the offering.
I’m not an expert, but I think when working directly with some models (not Cursor as a 3rd party), you can enable/disable caching, which if enabled incurs a charge. With Cursor, it seems like they are automatically enabling it or having the model decide when to cache. In regards to the general question why cache costs, I assume because it uses additional resources and is optional.
Why caching exists as a separate cost: every time you submit a prompt, if there was no caching, you would have to pay the higher input fees every single time. With caching enabled, it allows you to have multiple prompts within a timeframe that don’t need to resend the same large context input (ie. large text files) with each prompt. So the cost of writing cache and reading cache is generally cheaper then paying the input rate with each prompt.
When sending prompts in Cursor, you can see the context grow. I believe this is similar to what is cached, giving you an idea how cache intensive this chain of requests are.
The simple answer, model providers charge for cache. I think you have a misunderstanding of what cache means.
A number of model providers give the option for requests to utilize cache when possible. You are not being “charged for cache” but rather you are being charged a reduced rate to process your request. Cursor is simply passing along the savings that these APIs have. I don’t know the exact formula for the methods to cache reads and writes but at the very least think about it in simpler terms, you have a text based payload, it gets tokenized. If you send a number of requests within some time span that have the same context, you can skip that step and get savings. I know there is more going on behind the scenes but that might be a way to think about it.
I don’t think this needs the soapbox you have put yourself on.
Good explanation. However, when you said “You are not being “charged for cache” but rather you are being charged a reduced rate to process your request”, that may not be true. The cache write is an extra charge initially, even if not used. But the cost of cache writes are usually offset by having smaller input costs for later prompts, as you described.
OP is mainly misunderstanding what cache is used for in this context and that it is technically an optional feature, thus it is seen as a separate fee passed from the model providers.
Each chat request, each tool call & each follow‑up message is its own API call to the model, and for every one of those calls we have to send the full chat context so the agent has everything it needs. We cannot just send “only the diff.”
Prompt caching does not change what we send, it changes how the reused part is billed:
The first time you send a big chunk of context, those tokens are billed as a cache write (slightly more than normal input).
When you reuse the exact same context on later calls, those tokens are billed as cache reads (much cheaper than normal input).
Anything new you add on each step (your latest message, new tool results, etc.) is still billed as normal input.
Output tokens are always billed as normal output, regardless of caching,
Using Opus 4.6 numbers per 1M tokens (standard context tier):
Input: $5.00
Output: $25.00
Cache write: $6.25
Cache read: $0.50
Simple example (including output)
Say you have a 1M‑token context that you reuse across 20 steps (tool calls + replies), and each response is about 100k (0.1M) of output:
So for these 20 steps you go from $150 without caching down to $65.75 with caching. Most of that saving comes from those 1M context tokens switching from $5.00 per 1M (input) to $0.50 per 1M (cache reads) on every step after the first. That is why cache reads and writes appear as separate, chargeable items: they are different price classes for the same context tokens that still have to be sent on every request.
Overall caching reduces processing cost up to 90% while giving same output quality and full context for request like processing all tokens as input would do.
This is a feature we use on your behalf to reduce token cost within each chat thread and for all providers who use caching. Note that some providers include cache write into Input tokens and only charge for cache read while some like Anthropic separate cache writes. In all cases we show the precise token consumed in Usage report as provided by AI provider sent back with AI response.