Models comparison table

Hello, I have made a simple interactive table showing models available in Cursor. It was quite a nice learning experience with Alpine. I used many different models - V3 for small things, o3-mini-high for harder stuff, and sometimes Sonnet 3.7. Despite generally liking Sonnet a lot, for this particular project o3-mini was actually better at handling more difficult refactors, even compared to the thinking version of Sonnet.

https://monnef.gitlab.io/by-ai/2025/cursor_models_comparison

Edit: That award ribbon is automatic - given to model with best Aider score in its credit group. It is not necessarily the best choice (eg I would personally wouldn’t recommend Sonnet 3.7 on Cursor because of the price of o3-mini-high which I view like a better go-to model; for those more credit saving I would recommend V3, then o3-mini-high, then Sonnet 3.7 and then thinking)

11 Likes

What do the gold ribbons mean in your table? Does that mean you found DeepSeek Chat V3 useful on a level compared to Claude 3.7 and o3-mini?

I’d like to know the benchmark’s criteria

Is it using external API key?

Where is Claude 3.5 Sonnet

Does this imply that o3-mini (high) is better vs cost than 3.7 Sonnet?

Those are best performance (in the Aider Polyglot bench) in credit group (among models which costs the same amount of credits in Cursor).

Not quite. For example V3 - it means, it is the best from models which cost 0 credits, here only other model with cost 0 is 4o mini. I tend to agree with Aider Polyglot bench more than other benches (eg coding in LMArena), that is why I used that one.

o3-mini-high is in my opinion currently the best value (quality) for price (third credit per use), even overall. Others with third credit price is Haiku and even though it can better use agentic tools, it is just a bit weaker model.

If you need best performance, then it looks like Sonnet 3.7 thinking is best current. But I am pretty sure I read on this forum they plan to make it to cost 3 times compared to non-thinking version. If that happens, then Sonnet 3.7 thinking would be overpriced in my opinion and much better choice would be Sonnet 3.7 non-thinking, since it would cost 3 times less and quality in benches is not three times worse, only like 5%. So best go-to model would be Sonnet 3.7 and only rarely would you try the thinking variant. Probably best would be to try o3-mini-high before the overpriced thinking sonnet.

It is the Aider bench (I think few different languages, not only Python like so many other benches, and I think tasks from exorscism, so between puzzles and real code, probably?). I have added much better explanations including links to mentioned benches.

Added it with the last update :slightly_smiling_face:.

Unless something drastically changed in last day or two, yes.


I jinxed it… Of course it changed from yeasterday :pensive_face:.

Sonnet 3.7 thinking now costs 2x, not sure if they are any other changes. Cursor – Models

Edit: Updated to the new price of sonnet 3.7 thinking.

1 Like

In a similar, but slightly different context, I was wondering if certain models worked better/worse in various IDE’s? For example, does anyone have an opinion on the best models to use in Xcode with Swift?

I used Grok2 on my first two projects and it worked well. However, on my current project, I selected a series of models. Not knowing if this was a good idea or not, I thought I’d give it a try. Without any objective criteria, it’s hard to determine any particular/specific differences. As well, I experienced a lot of crashes yesterday. Although, many others did as well, so selected models likely were separate.

Anyway, I’d be interested in recommendations for the best models to use for specific development and IDE’s.

Thanks,
Jeff

Another leaderboard that might be helpful: https://web.lmarena.ai/leaderboard (I have no idea how they managed to get such a high score for DeepSeek-R1. Maybe we should pay attention to projects that can effectively utilize DeepSeek-R1? I certainly can’t make good use of it.)

You can see the problems used by aider polyglot here: https://github.com/Aider-AI/polyglot-benchmark. I briefly looked at one problem, and it seemed similar to LeetCode-style algorithm questions. I didn’t look at many, so I’m not sure if it has problems related to real-world development.

Also, o3-mini-high achieved a high score, but I wouldn’t expect a model that, according to the official documentation, “Counts as 1/3 fast request” and has a “Pricing” of $0.01 to be “high” rather than “medium” or even “low”. You can see the discussion about this “low,” “medium,” and “high” versions here:

Yeah, LMArena is questionable, I view it as worthless hype-driven nonsense (overall and coding parts). I remember the times when Sonnet 3.5 (and 3.6) was for me the best, yet on the arena was best 4o or even some gemini pro. No, from my testing, they were always few levels under sonnet. Nowhere near the intuitive beast Sonnet was and still is. I believe on LMArena people are trying some toy questions, not even those more clever ones like from leetcode.

Though for front-end, that https://web.lmarena.ai/leaderboard seems to capture the quality better. I see o3-mini-high more recommended for back-end and planning, so I find it in a realm of possibility for R1 to be better.

Edit: R1 was pretty good from my testing, but yeah, not on par o3-mini-high nor sonnet 3.6 or 3.7. Had it been priced reasonably, according to API pricing, then it could be a useful middle model, bellow o3-mini-high, sonnet, maybe even 4o, but also cheaper.

Yeah, exorcism is similar to leetcode, but still better (different languages and I believe also different operations on code, not just writing from scratch) than supposedly general programming benchmarks done virtually only on Python, creating (unrealistically) well defined function or script and especially in majority on ML tasks, usually without types. And this is coming from a dev who likes Haskell and its brevity in well-defined functions, puzzles, golf code.

Dev of Cursor confirmed it is high: O3-mini is LIVE! What version are we getting? - #70 by danperks .

And I agree. It is very strange, but because of how the pricing of Cursor works, o3-mini-high is the best value to quality ratio. In normal world o3-mini-high would cost like 2/3 or 1, and R1 should cost like the 1/3 or 1/4. It doesn’t even make sense that o3-mini-high costs in credits same as Haiku which is much worse; or Sonnet costing same as 4o which is like in API by 1/3 cheaper, so should be 2/3 of a credit… I suspect Cursor team have some special contracts with OpenAI for the o3-mini-high and failed to find US served R1 for normal cost. Another thing I recently came across that there are some issues with context window in Cursor, some new restrictions and discussions about it are being censored (here and on reddit), so that could another reason for the weird credit pricing.

1 Like

Thank you for your clarification

This is a very hard thing to do.

Different “AIs”, LLMs have different strengths across languages, paradigms, libraries, coding styles. In case of some AIs this can drastically change even in minor releases (eg GPT-4o is guilty of this many times, I think gemini too).

Then each platform (eg ChatGPT or ClaudeAI) or program (eg Aider or Goose) or IDE (Cursor, that censored here from Cod eium, Trae) or AI plugin (Copilot, Continue.dev) have their own way of working with an AI - from system/pre- prompt to RAG or agentic tools. This can quickly change in minor release or even no release (eg in case of Cursor when they change backend code).

You have so many moving parts which are hard to test automatically. Most likely in a time when you would finish manual testing even few most used languages (3? 5?), most used libraries/frameworks for all these (10? more?) and IDEs/plugins/programs (6? 8?) at least one major change would already occur and you could go back to retesting it…

Edit: I also forgot about an interesting phenomena - many models are much better at modifying/extending code they wrote, and much worse doing this to human code or code from other AI, even their own lineage - eg Sonnet 3.7 had difficult time refactoring and expanding code from Sonnet 3.6, but one shot complete solution with the required new features. :exploding_head:

Edit2: Remembered this bench - ProLLM Benchmarks it allows filtering by language and has Swift. Too bad it looks like for Swift the bench is saturated, so not really apparent what is better :confused:.

Hi,Could you share the method or tools used to create this concise and visually optimized table?Thx

I am not sure it is worth following, there are most likely better ways (like grabbing an existing table component).

I started on Perplexity (Sonnet 3.7) with search for API prices of the models, then in same thread I gave it copied table from Aider bench results and Cursor models (from Cursor wiki). Then with AIlin (adds HTML preview) I iterated over it on Perplexity (most of the color theme and design comes from Sonnet, I gave it only something vague like “dark theme, clean”), still a static table. Then I thought it could be nice to add some metrics (credits vs API price, credits vs bench score), sorting and since I recently saw AlpineJS (I am mostly used to React), a minimal front-end library, I though I could use that, see and learn from how Sonnet uses it.

I think at this point it got a bit too big for Perplexity to handle (response limit is like 3.5k tokens, couple hundred lines), so I moved to Cursor (for live preview I used Live Server extension). There with each feature or fix I would first try V3, few forth and back (just in chat mode, since it is a single file application), if not solved (surprisingly fairly rare, like only 20%) tried o3-mini (or Sonnet 3.7), if still no luck I would try last resort (Sonnet 3.7 if previous was o3-mini, or the other way around). I only had to manually touch the code few times, mostly minor visual tweaks (after all Alpine was rather unknown to me). AI, even those medium sized ones like o3-mini or V3 can explain basics of better known languages/frameworks/libraries pretty well (I would use R1, but the credit cost is too high - 1 credit, especially compared to o3-mini at 1/3 credit while o3-mini usually giving better responses).

Code is available at 2025/cursor_models_comparison/index.html · master · monnef / by-ai · GitLab (though you could probably see everything even in browser using “View Page Source” or similar, there is no build step)

1 Like