Why does Cursor consume an absurd amount of cache read tokens?

Looking only at the model output pricing is misleading. I believe Cursor actually makes money on cache reads. Even for a very small change in a single file, with the right context and explanation in the prompt, I’m seeing 178,304 cache read tokens. That’s insane.

Why does this happen?

For every new model available in Cursor, the first thing I look at in the pricing is the cache read cost, not the output.

5 Likes

Hello! Thanks for your contribution. This is an excellent question that many Cursor users are experiencing. Let me explain why this happens:

Why Does Cursor Consume So Many Cache Read Tokens?
The high consumption of cache read tokens in Cursor is normal system behavior, although it may seem excessive. The 178,304 cache read tokens mentioned by the user are typical even for small changes in a single file.

How the Cache System Works
When you work in Cursor, the process works like this:

Cache Write: On the first request, Cursor sends all the context (files, system rules, prompts) to the AI ​​provider, which processes and stores it in cache

Cache Read: On each subsequent interaction, the complete context is reused from cache, but is counted as cache read tokens

Accumulation: With each tool called and each follow-up response, the context grows and cache read tokens accumulate

Why the Numbers Are So High
Cache read tokens can represent 84-99% of total token usage. This happens because:

Cursor resends the entire complete conversation (including previous outputs) on each interaction

Each tool call and each follow-up request sends the complete chat history as input

The model needs to process all the context to generate responses, even though it’s cached

The Silver Lining: Cost Savings
Although the numbers seem alarming, cache read tokens are 10 times cheaper than normal input tokens:

Anthropic charges 10% of the input token price for cache reads, Gemini charges 25% of the input token price for cache reads. Without cache, you would pay 9-10 times more for the same tokens.

How to Reduce Cache Read Consumption
To optimize your token usage:

-Keep each chat focused on a single task

-Start a new chat for each new task (avoid long threads)

-Attach only necessary files to context

-Use simpler models for simple tasks

-Keep your rules short and focused

-Disable MCP tools you don’t need

It’s important to understand that, although the absolute numbers are high, this cache system is designed to reduce costs, not increase them. Without it, you would be paying full price for every token on every interaction.

3 Likes

Nah, I don’t buy this explanation.

This is kind of recent behavior from Cursor. That’s why many users are complaining. Months ago, Cursor was able to grep only the relevant parts of the code to send to the AI providers. Now it sends a huge amount of useless information and burns through all the credits in a matter of days.

They also changed the “free usage” to Auto/Composer, and then released a new version of Composer that’s twice as expensive.

Huge disappointment.

10 Likes

I also don’t think it is the reality. I always have been doing the listed things to have a clean context window, so that LLM can understand the actual task it is supposed to understand. Basically, I select specific lines and refer it in the chat, I don’t have rules, MCP or any additional context. Only the codebase references and the question I am asking, but still it is consuming like 2 Mio., 3 Mio. tokens per question on Agent and Debug mod. Whereas it wasn’t the case before. I have been using Cursor at my job and personal use, and I think there is no way I start to use the context window that bad, so that my token consumption would sky rocketted to Millions of tokens even with Haiku 4.5 not thinking model. I really enjoyed using cursor, butI also disappointed on the current change which make me re-consider using the Cursor.

3 Likes

You can “not buy” Andrés’s response but it’s accurate regarding the mechanism. Certainly there could have been changes to the way Cursor manages context which has ballooned the cache but take refuge in the fact that at least it’s cached and not reprocessing it every request.

Once upon a time I felt like context was being consumed ridiculously fast and just so happened to come across praise for the pi agent and reading the writings of both the author and some of the author’s friends I realized so much about what is wrong with current agent development is that the users have no idea what the agents are doing behind the scenes and under the hood. We cannot modify Cursor’s system prompt and we cannot modify its behavior around skills, the handling of AGENTS.md or any of the rest of it… We’re beholden to trusting that the core agent software is working to our benefit even as so many more apparent features break on the regular.

I use Cursor as “super autocorrect” these days. It’s great at that (leaving aside that I need to reconfigure the UI every time it updates wtf). I trust it to make simple (if sometimes wide-ranging) changes and I leave the real development agent-ing to pi where I can easily have it spin up tmux to drive multi-agent coordination in full view of my terminal with a record of what happened and with the ability to stop or redirect sub-agents when they go off the rails. This is simply not possible in an “integrated” development environment because you are left to hope that the “integrators” are integrating the things you want and also exposing the things that you want.

I’m also experiencing HUGE increases in read cache usage since 2/3 days ago

3 Likes

I’m also experiencing this issue:

The main disconnect for me is that I don’t have MAX mode enabled, so I should be limited to the context window for the given model, which for Sonnet is 200k tokens. Yet here is the context window usage within the Cursor UI for that top chat:

image

So only 76.8k tokens used. I understand that Cache Write doesn’t need to count toward the chat’s context window; it’s being stored for potential future use. Input and Output clearly count toward the context window, and I guess some portion of Cache Read. However, the total Cache Read is significantly larger than the listed used context or even the maximum possible context window for that model. The chat is a single prompt, focused on one task, and it never summarizes to reset the context window usage during the course of the output. So where is all that Cache Read being used?

I completely understand your perception, as it’s the same one I’ve had for a long time. I even opened a thread here on the forum questioning Cursor consumption in relation to conventional models, especially Opus without MAX active.

The impression is that usage is somehow increasing, which doesn’t justify spending 1 million tokens for a simple request.

I hadn’t stopped to analyze this cache issue, but I was shocked by the consumption of a simple request on Cursor and how much cache it used. Now comes the big question: if the cache aims to reduce token usage by up to 10x, how is the actual usage so high?

If 4.7 million tokens were consumed in cache, how much was spent without the cache (actual usage)? It makes no sense to say that the cache reduces usage by up to 10x and a simple request consumes 4.7 million tokens.

Furthermore, in recent months I’ve noticed very rapid use of the plan’s “allowance” (Pro+). Basically, I’m using up all my data allowance in a week. Just yesterday, it was already at 55%, and my plan expired on the 14th, which leads me to believe that Cursor might be charging me incorrectly.

Unfortunately, I gave up discussing this with support, since, besides being slow and terrible, they simply said everything is normal.

But if it’s normal, how come so many people are complaining about practically the same things?

Paying $60 for my account limits to be exceeded in a week is easier than paying for the API usage directly from the AI.

2 Likes

My $50 over in just 2 days! Wow!! Thanks, shifting from cursor to another agent. This is loot only.

@deanrie @Colin

Could we get some answers from Cursor regarding this issue?

@dbdbdb @Cirano_Eusebi @Batuhan_Gunduz @Leland_Hepworth

Token Usage Appears Higher Than Expected (But Costs Less)

AI usage dashboards often display token counts that appear inflated compared to user expectations. This occurs because most users assume billing is limited to input tokens (what is sent to the model) and output tokens (what the model generates).

However, modern AI systems track all token types in usage statistics—including cache write tokens (storing context) and cache read tokens (retrieving cached context). When a report shows “243 million tokens used,” this figure represents the sum of all token types, not only the visible messages.

How Cache Tokens Reduce Costs

Without caching, each turn would require reprocessing the entire conversation history as fresh input tokens. Caching stores context so subsequent turns process only new information, retrieving cached content at a significantly reduced price.

Token Type Pricing Structure

Claude and similar models apply different rates to different token types. The official pricing structure is as follows:

Source: Claude Platform Pricing

Price Differentials Across Token Types

Cache hits cost $0.50/MTok compared to $5-15/MTok for base input tokens—a 10-30x price reduction. This differential explains why effective costs per token fall below the headline input/output prices. Large token counts in usage reports typically include substantial cache read operations, which lower the average cost.

Real-World Usage Data

The following table presents actual average costs per million tokens across various models, based on billing statements:

*user tokens = used tokens

Analysis: Claude Opus 4.6 at 243M Tokens

The Claude Opus 4.6 entry shows 243.2 million tokens billed at $299.00. A straightforward calculation using only the listed output price (243.2M × $25/MTok) would yield $6,080. The actual cost of $299 represents approximately 4.9% of that calculated amount.

This discrepancy exists because the 243.2 million tokens consist of multiple types with varying prices:

  • Input tokens ($5/MTok)

  • Output tokens ($25/MTok)

  • Cache writes ($6.25-10/MTok)

  • Cache reads ($0.50/MTok) — typically the largest component

Cache reads constitute the majority of tokens in extended conversations and cost 50× less than output tokens. The blended average therefore reduces to $1.23 per million tokens rather than $25.

Summary

  1. Large token counts include cache operations at substantially lower rates than input/output tokens.

  2. Effective rates fall below headline prices. Cache reads cost 10-50× less than standard tokens.

  3. Caching reduces total expenditure by eliminating the need to reprocess entire conversation histories on each turn.

  4. Large samples provide reliable cost indicators. Token volumes exceeding 100M produce stable average costs that reflect real-world billing patterns.

The 243 million token figure does not indicate 243 million output tokens at full price. It represents a weighted combination where the majority of operations occur at significantly reduced rates.


Data Limitations

The cost per million tokens presented in the usage table may be slightly understated for certain models. During the early implementation of sub-agents in the Cursor IDE, sub-agent costs were aggregated under the main orchestrator agent’s billing. Consequently, models such as Composer 1.5 and Claude Opus 4.6 include expenses from less expensive models within their total cost, which were then divided by the full token count—producing a lower average than would result from isolated billing.

Actual costs for these two models may be approximately 30-40% higher than the figures shown. This adjustment does not alter the fundamental conclusion: effective token costs remain substantially below the listed input/output prices due to the high proportion of cache operations.

Crazy cache read numbers, but crazier that there’s no way to see what’s actually being sent. Is it pulling in entire project context every time or just the conversation history? Without visibility into the payload you can’t even tell if something’s misconfigured on your end vs just how it works now.

1 Like

Nice explanation (for real), @gabriel-filincowsky .
but i believe that you missed this: worth nothing cache read tokens costing 10x less but Cursor spending 10x more cache reads of unecessary cached data.

i truly believe that since some months ago, Cursor started to send much more irrelevant data to the LLM providers instead of better grep strategy.

A simple task spends 1-2M cache read tokens. this is just unacceptable. that’s why i believe that Cursor is sending irrelevant/non-related code to LLM before anything for every **new** chat

1 Like

@gabriel-filincowsky

Did you read our responses?

These are new chats with a single prompt that we’re talking about, not some longstanding chat that has been prompted several times. So there is no conversation history in this case. Maybe some project files are being included in the Cache read. This would be a good thing; like you said, it’s cheaper than re-reading those files at the regular token rate. But the number of Cache read tokens being listed is still far too high to account for that.

We understand that Cache tokens are cheaper than regular tokens; that’s not what we’re asking about here. We’re saying that the number of Cache read tokens being used seems significantly larger than what it should be. There’s also still the disconnect of how millions of cache read tokens can fit inside a 200k context window size, and if they’re not part of the chat’s context window, where are they actually being used?

1 Like

Hey all!

This is effectively just going to rephrase @Andres_Cardona’s answer, but I hope it’s helpful. It might sound like it’s talking to beginners, but I want to make sure this is approachable for anybody reading!

When you hover over Tokens on your usage page, the number you see is the aggregate across every LLM call that contributed to that request, not a single call.

A single message in Cursor can (and typically does) trigger multiple LLM calls under the hood. The agent may read files, invoke tools, apply edits, or reason through a plan, and each of those steps constitutes a separate call. All of them are rolled up into a single row on the dashboard, and aggregation stops only when the request is complete (when you can type a new message).

To illustrate: suppose your first message sends 20k tokens of context, and overall it requires 10 LLM requests to finish. You’d see 20k input tokens and roughly 180k cached tokens, because each subsequent request reuses the same prefix the provider already has cached. Those cached tokens also carry forward to the next message within the same conversation.

This is also why you might see a total token count that exceeds the model’s context window. It isn’t one enormous call, but the sum of all calls made during that turn.

If you’re curious about what’s consuming your tokens, you can ask the agent directly: What’s in your context window right now? Be exhaustive.

We’re always looking to make the context window more efficient!

3 Likes

Just to call out another example, I have a repo open where the tool definitions, system prompt, and other information (rules, skills I’ve defined) take up ~47.5 k of context. No files from my repo are included in this starter context.

I just sent “hi”, nothing else. But because I’d been working in other chats in the same repo, the provider’s cache already had most of that prefix, so it shows up as 47,499 cache read tokens and only 171 input tokens. The cache is doing exactly what it should: avoiding re-processing tokens the provider has already seen.

Imagine I submit this prompt:

read files and then decide the next file to look at. Do this 10 times, and make sure you think in between.

No surprise, huge cache read on this session, which took 13 requests and eventually opened a file with ~17k tokens (which was added to the next requests as cached tokens)

One factor that may contribute to the perception of higher cache token usage is that models and our agent harness have improved at sustained, multi-step work. A single message now often triggers 10+ LLM calls autonomously, rather than 3-4. The total work (and tokens) is similar to what multiple shorter turns would have consumed, just rolled up into one line item.

1 Like

Hey Colin,

Thank you for taking the time to explain it in more detail.

I would like to suggest that, instead of showing the total aggregate by default, the reporting show only the aggregate of the input and output tokens. When hovering over the value, we would still see the cache read and write.

I understand this is another request increasing the pile, but it is a small thing that helps the company better align with the client’s expectations.

Thanks!

Thanks for your response, that really helps clear things up. On re-reading the response by @Andres_Cardona, there was evidence of this in the mention of “tool calls,” but I missed that under all the talk about replies to the same chat session and avoiding long threads, which I already knew about. Your response was more targeted at my actual misunderstanding.

Here’s an example of an issue that increases the number of command calls and therefore the amount of Cache read used:

Right when the agent receives an output from a command, it starts writing that it is still waiting on output from the command, even though it’s writing functionality is blocked until the command finishes, so the command output should be available. So it tries executing a sleep command, waiting for the ouptut that it already has and can see after the sleep command finishes. This waists time, but more importantly, it wastes cache read tokens with each erroneous sleep command executed. It’s an intermittent issue, but a costly one when it does happen.

Hey Leland,

Sorry, I didn’t notice that was the first message.

Keep in mind that a lot of stuff is sent in the first message. Anything between 25 k and 35 k tokens, even if you just send “ping”. These extra stuff are things like:

  • System Prompt
  • Tool Definitions
  • Rules
  • Agents
  • MCPs
  • etc.

Unfortunately, this is a current limitation on how LLMs work.




On the MCP side of things, users in 2.5 should notice a drop in tokens consumed by default due to Dynamic Context Discovery!

1 Like