I think you should preserve the original names, especially when the provider already has another model with that name. Why let all users be confused and ask for the high model initially? Poor experience imo
I missed this thread or I would’ve posted here in the first place, but I compiled some coding benchmarks:
Aider:
o3(high) is ~17x the cost of Gemini 2.5 Pro for <10% improvement over it’s score.
LiveBench:
SWE:
Codeforces ( IMO):
Personally, I think it would be great to have a Cursor specific benchmark since none of these will factor in it’s custom scaffolding (the testing specifics could remain private to avoid gaming it.)
It would be great to have BENCHMARK specific to cursor.
So many models now…
Is o4-mini in Cursor just completely unusable for everyone else or just me? I use it and it just results in excessive tooling calls usually without making file changes. I just tried to change a file before and it started grepping a bunch of files in my codebase and a bunch of other tooling calls. The few times I tried using it, it exceeded the 25 tooling call limit (where it asks you to resume) without recommending a single change.
I don’t care that o4-mini is slow. I will happily wait for changes if they’re accurate and not prone to hallucinations. But it seems that whatever Cursor is doing instruction wise is causing the model to just continually use tooling. It’s tedious.
It’s not just this model. Yesterday, I used gemini 2.5 and sonnet 3.7 until I had made more than 120 fast queries without receiving any results. After that point, all the tools started running in slow mode and all changes are being applied to the files
o3-pro will be replacing o1-pro soon…
Though soon here means a few weeks.
Thanks! But does it make sense they’re listed as supporting agent mode despite not being premium?
BTW, it’s impossible to read the comments. Update: thanks for fixing!
Also, is it not time to finally add a column of what only works in usage-based billing?
yes. mine literally has not generated anything yet. it just says generating and then nothing happens. Then my prompt is put back at the bottom to submit again.
That’s what the premium column is for.
If it’s premium, it counts towards your premium requests. Unless the price column is free, then it costs no premium requests and is completely free to use(even if temporarily, like GPT 4.1 and o4-mini). And some models might cost 1/3 premium request per actual request.
If it’s not, each request charges the amount on the price column, barring the daily free usage for some models (usually 10 per day). If it says free on that column, it’s totally free to use.
If it’s max, each request and tool call charges the amount on the price column.
Thus, any non-premium model requires usage based billing to use, unless they’re free.
Yes, I agree it seems like:
- No Premium+Free => never uses usage-based billing
- No Premium+Price => always uses usage-based billing
But it seems a bizarre non user friendly way to present it like that.
Especially since the docs contradict all of this with “Non-premium models can be used by enabling usage-based pricing”, completely ignoring free non-premium models (I’ve already reported it here).
Yeah, it’s pretty complicated.
I’m still not able to use o4-mini despite being on the latest version:
There’s an old OpenAI key in my settings but it’s already disabled. Attempting to remove it (clearing the textfield) does nothing as it just comes back!
Any way to remove my API key or fix things on my end?
Unable to see o4-mini’s thoughts, in the Cursor Agent Mode.
Why does my cursor still show “o4-mini”? I asked the model, and it said it was released in 2024. There’s also no model like “o3-high” available.
What do you mean “still 04-mini”? That’s what it’s supposed to be. As for asking the model, did you try asking directly in ChatGPT to compare? Models often aren’t honest about who they are.
As for reasoning, Cursor always decides on 1 reasoning and doesn’t put it in the name itself. The Models’ doc clearly states o3 uses high.
Exactly, but if they did that, everyone will go to the model that has the cradle mark bigger, understood and would overload the system, I believe, I believe