Why is a simple edit eating 100,000+ tokens? Let’s talk about this

all what you said is correct, you people dont suffer or lose money only the end user is suffering so instead of jumping on every new wagon as fast as possible its requested to kindly make sure its safe the user wont be suffering and charged for hundreds of dollars. and then you people have the audacity to come tell people as we only provide this service of connecting you to mars and jupiter and as the problem is in the jupiter therefore we arent someone to blame, THEN DONT USE THAT COMPANIES MODELS. cursor has enough power to push anthropic to fix issues therefore use that power rather than telling users what they claimed was wrong and blaa blaa blaa, u people sound like a three month old child who makes a mistake and when someones ask them who did it they say “MAMA”, grow up.

Thanks for your feedback, appreciate it and will pass this also to the team!

Hello dear, and as long as you are here, can you tell us directly the secret of legendry token consumption like 62M token in one request, and why cursor team is not working to reduce this by 80% or 90% as its very easy for any developer to do this without affecting the quality of the output, and by adding alot of layers of caching and summarizing that help to reduce token and maintain your customers loyalty

Hi @Mohamed_Khafagy, I will summarize what has been posted by several team members in other forum posts to this topic.

  • AI API requests are logged as reported by AI providers with token usage
  • We have looked into the Request IDs posted by users and found that the usage is as reported.
  • Factors that increase token usage are requests needing many steps by Agent (e.g. tools, mcps, follow up questions) as each step is a separate API call to AI provider and the whole context is sent.
  • Note that we already apply a 75%-90% optimization with AI providers that offer it by using the AI providers cache. That way the already processed tokens of previous steps & requests in same chat are not re-processed again and the Cache Read cost is 10-25% of an input token cost. This depends on provider, Anthropic offers 10% cost and Gemini 25%. Most of large token amounts are such cache reads where AI uses it to produce the next output.
  • Also when reaching token limit of a model we apply a summarization and are rolling out improvements on this too.
  • We are further looking at optimizations but no “simple or easy” one has been found for now. If you have such an solution we would look into it.

For specific requests a Request ID would be required to let the team check into it.

Additionally I wrote a more detailed explanation:

Hi @condor, I’ve seen your reply and read the documentation you provided in great detail. Honestly, it confirms exactly what I suspected the moment I first saw how your token usage works.

What you explained in both your reply and the documentation reflects the basic way LLMs operate, every time a reply is generated, the model treats the entire context as new, analyzing it from A to Z to give a response. Naturally, as the conversation grows, token usage increases and gets duplicated. To solve this, you made a great deal with the LLM provider to get a caching solution at a significantly lower cost, which helps reduce the overall token charges. You likely thought this would make users happy and that both sides would live happily ever after.

But surprisingly, when users saw this solution, many of them became frustrated, even angry, and instead of showing appreciation, they started searching for better alternatives and complaining across your discussion board. This is the pattern I noticed in many Cursor team responses over the past month. Emotionally, it makes sense — you’ve put in great effort to build such a powerful editor and keep pricing reasonable, yet users don’t fully understand this and keep repeating the same complaints again and again.

Let me be honest with you, the issue you described is not solvable the way it’s currently framed. From your side, you believe managing tokens is the user’s responsibility, not Cursor’s. But from the user’s side, it’s your problem, because we pay you. So eventually, it is your responsibility.

And I believe we’re right to say that. If you continue sending full files and folders, with all their raw context, to expensive models like Claude, hoping to get a discount through caching while pushing the problem to the model provider, then respectfully, I think that’s the wrong approach.

Can you tell me what benefit you’re getting from sending 500K lines of HTML files with simple repetitive code — static layouts, boilerplate tags, and unused blocks — 20 times, causing 1.5M tokens to be consumed, just to edit something that’s only 30 lines long? There is no real need for this.
Instead of sending 500K raw lines, I should be sending 50K or less — focused on the relevant files and code sections, organized with line numbers, marking where each function, component, or block starts and ends, and with short descriptions explaining what each part is doing.

Now let’s talk solutions, here are just a few very basic ideas, and there are many others that can be applied here and there:

  • you must split processing into two stages:
    First stage: runs through a lightweight model to summarize files, extract needed sections, and reduce token load
    Second stage: send the minimal and smartly-structured context to Claude or any other premium model

  • Reduce functions to their key logic
    Skip repeated patterns
    Include line numbers and scope
    Highlight what each part of the file is doing
    give instruction about how you expect the return

  • Use map files inside the project that describe:
    The content of each file
    The project path structure
    Dependencies between files

  • Check MD5 hash of files to make sure the latest edits were made inside Cursor.
    If the file hasn’t changed since the last request, then you already have the summary from the previous time.
    In that case, don’t resend the full content or waste tokens reprocessing the same file.

As a CTO and developer with 15 years of experience, I can confidently say there are many practical ways to solve this issue efficiently — and I’d be happy to join a call or meeting if you’d like to discuss it further.

2 Likes

is this chatgpt lmaoooo

It doesn’t work like that. gpt-4.1-Nano will not be able to figure out which code is important for Claude or Grok and which is not. This should be decided by the main model itself.

It doesn’t work like that. You’re literally asking them to send docstrings instead of code.

They are already indexing projects somehow. You can also try my Agent Docstrings, but I’m not sure if it can affect token usage. Although it was originally conceived as a tool that reduces tool calls, so it might work, because this is the genesis of the problem of excessive token usage.

Cache Writes and Reads are literally about that.

Those who use more than just Claude may have noticed that Claude eats too much cache. I assume that this is a feature of Claude or the Anthropic API itself.

It’s just weird to say that everything is OK when Claude is really too gluttonous relative to other models on the same tasks. His monetary efficiency is lower than that of others, despite the fact that he is really good on his own.


And @danperks , who complained about the lack of a well-designed bug report, did not respond to mine. Apparently, I designed it too poorly :frowning:

Although I must admit that I got worse results in it than I had planned. :roll_eyes:

But anticipating that the spent tokens and time would not be returned to me, I did not want to repeat the test a couple more times.

No need to figure out which code is important, you summerize it all to 20% of its originaln size

I have tried this out many times and one line of description can replace 10 lines of code without any problem returning same results

Your tool is good, but regarding to indexing and according to tokens used, cursor is not indexing or summarizing anything

Yes that works too, but it costs alot for every developer here

and that’s why I’m changing to Claude Code… Max is expensive, but at least I know what I’m getting

@Mohamed_Khafagy I agree with most the response of @Artemonim . The solutions you described are either not practical, would produce issues or already used where possible.

Hi @condor can you please clarify what part that may cause problems, with good understand of code logic, most of LLM return correct code, without sending full code with every request, the trick is how you send this code and how you organize it to receive correct answer

I can’t answer specifics as I do not work on that part so I responded from my good understanding and experience.

The team is aware of feature requests on this and is looking into further possible optimizations and improvements. So I am not dismissing the general need for optimizations.

Each summarization or code removal by the nature of such changes removes context that may be required for inference. Agent already pulls only required parts, so further removing required context would lead to worse results.

Summaries for each file would actually increase context and complexity unnecessarily.

Code indexing does a much better job at identifying relevant parts without loss of precision.

Users have also requested to not do summarization or condensing of large files. Though those are already features that exist.