Discussion on improving token consumption transparency and the "saving mode"

A few days ago, I posted that I was planning to leave (the post has since been hidden). After that, I tried other AI editing tools and found that they can also do a good job. However, as a long-time user who has been trying Cursor since its inception, I am unwilling to see an excellent tool be destroyed. Next, I’d like to discuss a few issues that I think are relatively important.

1. Fairness
We can see on the forums that many people claim to have quickly exhausted their quota. Some have even listed the usage amount as claimed by Cursor. Here, I’ve noticed an interesting phenomenon: the amount used by each person who hit the quota is different. One user claimed to have hit the limit after using what was reported as $86 worth of tokens, the development team once pointed out that another user had actually used up to $150 worth of tokens, while I hit my limit after using what Cursor claimed was $64. Why do these discrepancies exist?

2. Token Consumption Transparency
Currently, we can only see the number of tokens consumed in the console, but this number often differs greatly from the user’s perception. I believe the development team did not intentionally inflate the count, and I’m willing to believe them when they say there are no bugs in the counting. So the question now becomes: why is the token consumption so unexpectedly high?

From forum posts, I can see that many users have experienced incomprehensibly large token consumption within a single conversation. At first, I was also very confused; I felt it was impossible for a single conversation to cause such a massive drain. This was until the post from danperks at Frustrated with Cursor’s Token Hungriness - #197 by danperks mentioned how LLMs work. This, combined with the information I obtained from a prompts injection attack on Cursor’s system prompts: System prompts that may cause token waste
According to danperks, each tool call consumes tokens equivalent to the entire context. And yet, in the system prompt, they encourage the model to make multiple tool calls, encourage it to search as much as possible, and even force it to search multiple times for the same question. I believe this can indeed improve the model’s accuracy, but it comes at the cost of consuming a huge amount of tokens.

Furthermore, I believe there is a trap in danperks’s explanation: if the initial input is 20,000 tokens and the first tool search adds 10,000 tokens to the context, then for the next 10 tool calls, the total consumption is not simply 300,000 tokens. This is because each subsequent tool call adds new content to the context. Assuming each call adds just 5,000 more tokens, the consumption after 10 tool calls would be 550,000 tokens, which is already 27.5 times the initial input. I’m afraid this is the real reason for the terrifying amount of token consumption: a system prompt that, in order to complete a task, frantically calls tools, and each call causes the token cost to balloon.

My suspicion is that Cursor has been using token-wasting methods to achieve a better understanding of the context, creating the impression that Cursor is superior to other tools. However, as model prices have risen, they can no longer afford such huge costs and have thus switched from a per-request fee to a per-token fee.

But this is making users pay for the Cursor team’s mistakes with their own wallets. Indiscriminately spending massive amounts of tokens to improve functionality, only to blame it all on “users not knowing how to use LLMs”—this is what the Cursor team is doing.

3. If my analysis above is correct, should a “Frugal Mode” be added?
Even in a complex project, a complex conversation might not be finished, and you might need to insert one or two simple tasks in the middle (this is very normal, because giving an overly complex task to an LLM all at once is not only difficult to describe but also has a high probability of not being completed well). At this point, the context is already large and sufficient for the model to complete the simple task. There’s no need to execute as many searches and tool calls as possible. In this situation, would it be possible to allow sending requests in a way that reduces tool calls and search depth to save tokens?

I look forward to a response from the development team to let me know if my speculation is correct.

2 Likes

Check this: URGENT: Pro+ Plan Token Consumption Bug - 65% of Monthly Allowance Used in 5 Days

The billing rules are not transparent. You guys messed up such a good product! Goodbye!I have cancelled my subscription and I am going to subscribe directly to Claude Code