This is getting out of hand

Hi,

currently 4-5 month cursor user here, with a lot of productivity I may add.

However….

It currently feels like Cursor is just one big moneygrab. I mean, 1 - 4 mill token usage per prompt, seriously?

I mean if they have the feeling they currently have the upperhand in the IDE warfare they are absolutely right. But there will come a time that they won’t have anymore, which will be the downfall.

I must say I am not surprised in the thinking mode. However, it hugely depends on how you manage your prompt and something else. When you work on a feature - make sure you scope it - write some form of implementationPlan.md and a taskList.md - it needs to be narrow scoped - refer to it when you work - that way you are keeping context narrow and not bashing through the entire repository. You need to work with it like a software architect/director - otherwise Cursor will run around like a headless chicken.

1 Like

First mistake is using a Claude model. Don’t use them. They don’t even have the highest intelligence but cost more than 10x compares to other models that are smarter. For what they charge you they should be sending you cookies in the mail ever time you make a query.

You could use o4-mini near endlessly and never reach your cursor limit

2 Likes

Is that really true? Benchmarks like SWE-bench show that performance with 04-mini is nowhere near as capable as solving real world problems as even earlier sonnet models? I think you are probably better off running sonnet 3.5 or 3.7 if 4 is too expensive.

It’s not like we are using cursor to ask models to take exams, where o4-minis performance is significantly closer/better than sonnet models. We are using it in the domain of coding, so coding benchmarks (and good real ones), are what matters here?

Yes I believe it is absolutely true, without even getting into the problems around SWE-bench. On benchmarks that take more things into consideration such as pricing, speed, cost per task, etc. instead of just raw and more arbitrary scores, Claude models consistently rank as some of the least desirable models and it has been this way basically since the beginning. Here is an example (not even going to put Opus on here because it completely destroys any pricing/cost graph it is on by shoving everything to one side and making it unreadable due to its ridiculous pricing and cost to use) : Comparison of AI Models across Intelligence, Performance, Price | Artificial Analysis

Also I don’t know what you mean be “I think you are probably better off running sonnet 3.5 or 3.7 if 4 is too expensive.” because they are all the same price which is another huge point of discussion (and just using the SWE-bench argument could also mean that there is no reason to use to use 3.5 or 3.7). Where the industry has been having a very good downward trend in pricing, Claude/Anthropic has stagnated by keeping their ridiculous prices the same and even regressed by their incredibly overpriced opus model releases.

Claude is not worth using at all in an environment such as Cursor where your access can be blocked by how much cost you accrue. Whoever convinced the dev community that they are worth using at all deserves a Nobel Prize.

1 Like

You are right, my bad. I didn’t realise that Anthropic was charging the same for 3.5 and 4. That’s truly nuts. EDIT: I’ve probably noticed it’s “cheaper” because it does less tool calls to check things, you have to spec them properly, but it doesn’t do 50 checks and test runs. It also doesn’t call other models to do things like search for code content in your codebase. These tool and model calls are expensive. For my use case, therefore, they are functionally cheaper because the extra value of some of the claude 3.7 and claude 4 models is not worth the cost for me.

But I think you are missing a few points. Firstly, the value you put in quality, depends on your sensitivity to price and the value of your time. Personally, I’ve found that Claude is much more reliable than Gemini 2.5 Pro for coding work.

In terms of the metrics used in your link, most of them are not coding specific, and the ones that are coding specific are more related around problem solving, than building code. Ie they will tend to be optimization/algorithm problems, not “refactor this code” problems. Many of the tests are nothing to do with code, they are reasoning etc, humanities last exam has literally nothing to do with coding.

Would love to hear your criticism of the SWE benchmarks, because in my research they looked pretty reasonable to approximate what developers are using these models for.

I don’t have a wagon to push for Anthropic/Claude, I’m not a shareholder or a shill. I don’t really care what model I use, but I want to get good results. I’ve done most of my serious testing with Gemini Pro and Claude 3-4, and Deepseek R1.

Frankly for coding work, rating Gemini 2.5 Pro or R1 higher than claude does not match my real world experience using those models. There’s a reason that claude has a good reputation with developers, because it tends to perform well in real world development environments. It’s behaviour is pretty good. I’ve tried opus, but I didn’t find it anywhere near worthwhile compared to sonnet.

Mind you, I’m not vibe-coding, and I’m working with established code and interlinked systems. That’s the environment I’ve tested them in. Interested to hear about your experience.

3 Likes

For me the point is how much cost usage of cursor accrues (the topic for this page) and how much one can actually use cursor which is (now as of the new pricing model) determined how much cost one accrues. With this in mind Claude is a bad choice.

I too use them for non vibe coding tasks and the right context instructions, all models over a baseline intelligence perform about the same imo, with really hard ones requiring a lot of reasoning. I see it as a user error/skill issue of not giving the right instructions to get what you want, with of course a little variation between models. But this is a rabbit hole I don’t want to get into really. Everyone and their dog has their own opinion over which is best based on almost exclusively anecdotal evidence. I use OpenAI models almost always, and was using Claude almost always when they had half usage a month or so ago and found myself frequently going back to o4-mini or gemini pro during that time.

I too have no wagon other than I want Cursor to work well and not lock me out after using Claude and making 1/5th of the requests I could have had using o4-mini. This problem is exacerbated by Cursor focusing so much on Claude working well within Cursor and neglecting other models which in turn causes more Claude usage among user. The other models certainly could perform the same or better if they also got the same attention and care.

As for benchmarks, there is a lot of criticism right now around SWE-bench being a single repo that is mostly if not all python and django (which is ironic because that is what the projects I work on use). A few searches on X can get you some discussion around it. I would say artificial analysis is more fair, has general and coding focused benchmarks on which Claude does not rank among the top despite its high cost. The link I shared is basically all the models that would be worth one’s time and money using on Cursor.

I am mainly focused on cost and usage limits on Cursor, in which Claude is a bad choice all things considered in my anecdotal experiences. Even if Claude is a little bit better, I do not think its 5x (and 20x with opus) cost over o4-mini makes it worth using on Cursor

2 Likes

Thanks. I think that is totally fair, for me the performance improvement I think I’ve seen with Claude, is worth it. Personally I find that when looking at a codebase, Gemini 2.5 more often gets the “wrong end of the stick” when interpreting what you are doing and why (and therefore how to fix something or implement a bug fix), perhaps the problem is that I’m not giving enough direction. I will try o4-mini and get some experience with it, after your recommendation.

I would agree, good quality data is hard to come by. What I would say, is the benchmarks you linked, while useful, are not what I would consider good coding tests, but they are good tests of apparent intelligence, reasoning, and ability to maintain a coherent understanding of the thread of back and forward messages. These are all useful things, but based on my experience using the models I’ve used, anything that puts Claude below Gemini 2.5 in terms of coding performance for my use cases, is probably wrong. Could Gemini 2.5 do better at reasoning or maths? Absolutely, in fact I’d bet on it. Could it therefore do better on the kind of optimisation problems that make up a lot of code challenges, absolutely. But in my experience, those challenges don’t match the reality of what developers are often doing. Trying to make changes to an existing codebase or do something that is not well documented in the corpus. These are the areas I’ve found Claude performs better. When you present a model with a largely undocumented API for an in-house service (with source code access), it will build things that integrate better than Gemini in my experience.

I’ve been using gemini-cli, which is free and uses 2.5 pro. And I much prefer to deal with cursor+claude.

2 Likes

Okay, I’ve been trying o4-mini. For the price it’s great. Still finding sonnet a better model for big tasks. I’m not sure if the billing is quite right for cursor, it seems to bill a lot for cache hits, with huge numbers of cache hit tokens on claude. Not sure if that’s a cursor or claude issue.

So far I’ve found a few little stumbles with o4-mini, you have to take care to check everything very closely, including reading it’s output carefully. Generally I find sonnet “understands” what you want, but at 5x the price, I’m going to save claude for big complex tasks, and use o4-mini for smaller things. Thanks for the recommendation, @ColemanDunn .

Definitely like o4-mini over gemini-2.5 pro.

2 Likes

same here, 3/4 of this is “excuse me, i will make a quick fix” (claude 4 & claude 4.1)

I’ve just been thinking/researching, since I had most success with Claude 4 and it’s gotten way too costly with new cursor pricing, what model would be the best alternative? So this thread gave me a lot of knowledge, so thank you @ColemanDunn and @TinBane

Would be nice for people complaining about pricing/sonnet, to have a stickied guide here on the forums or on official cursor documentation, about what models might be best based on pricing.

1 Like

That’s a great idea!

I wanted to add some context to some of the things I’ve said.

My experience is that claude 3.5 and 3.7 are cheaper, they are in fact cheaper, even if they are the same price per token, because claude 4 seems to ingest a lot more tokens.

I just tried chat GPT-5, it seems very smart and capable, and also doesn’t use as many tokens.

On the files I’m currently working on, claude 4 was averaging 400k tokens per prompt. Chat GPT 5 solved an issue Claude was struggling with, and if the cursor stats are correct, used 58k tokens.

Hi everyone, I will address the main topic:

  • Usage listed is per Request which may contain multiple tool calls
  • We show what the AI providers report as token usage via API directly.
  • Longer chats or chats with many tool calls accumulate token usage over time.
  • Context used adds up as well, reduce any unnecessary attachments, rules, MCPs,…
  • Thinking models use up more tokens than non-thinking models, and
  • Heavier models like Opus cost 5x the amount of Sonnet.
  • Use Auto where possible to reduce consumption.

More on token usage and how to optimize it to get more out of your plan