Gemini 2.5 vs Sonnet 3.7 vs Grok 3 vs GPT-4.1 vs GPT-o3

joaquinhuergo · April 15, 2025, 7:55pm

Can someone please tell me which model is the best for CURSOR?

I want to stop overthinking that I am not taking full advantage of Cursor with the models I am currently using, and just be sure I am using the absolute best.

Let’s try ending the discussion with one definite model, and that’s it!

normalnormie · April 15, 2025, 9:07pm

in short: Gemini-2.5-pro, currently it has issues with tooling(around 2/10 of requests), Cursor team is actively working with Gemini team to solve them, Claude 3.7-thinking is second but perfect tooling although less context than Gemini, I’m currently using Claude to plan and Gemini to act, Claude gets all the context needed with its great tooling and Gemini just edits the code with all my rules

ValadaresX · April 16, 2025, 3:09am

Quick answer: use “Gemini 2.5”.

Gemini 2.5: It currently throws some communication errors with the Cursor server, but nothing that will leave you without responses for more than a few minutes. I recommend using a general rules prompt so it becomes more concise in its answers, as I find it quite verbose in that regard. I’m impressed by the model’s level of questioning — it doesn’t just do everything you ask like a mindless slave. Overall, I’d use it in contexts where you expect the conversation to be long.

Sonnet 3.7: I don’t use it. I saw people complaining that it acted like a “proactive idiot,” and to be an efficient AI it would need a lot of refinement in terms of context rules and prompts. I tested it a few times, and it only gave me absurd answers about my code. For those reasons, I didn’t even bother using it much.

Grok 3: I haven’t used it yet. Since it has a small context window and my project is large, I haven’t given it a chance.

GPT-4.1: I’ve used it a bit. If you’re looking for short answers, I’d recommend it — just use it with caution. It might be good for validating small ideas. I haven’t tested it in long conversations.

Take this into account: maybe I got these results because I already have specific project rules defined in my Cursor setup — that might have limited how the other models (besides “Gemini 2.5”) responded.

mhmtakcnnn · April 16, 2025, 8:26am

As someone who actively used Claude 3.7 Sonnet Max, I’ve completely stopped using it. Now, I only use GPT-4.1, because honestly, it performs even better than the MAX version.

Gemini still has ongoing tool issues, while GPT-4.1 is currently amazing at logical reasoning and code refactoring — it solves the required parts quickly without writing unnecessary lines of code. It’s also very effective at analyzing code.

With MAX, I often couldn’t get the answers I needed, and it would result in massive and unnecessary code changes. Since switching to GPT-4.1 (especially using Agent mode), that problem disappeared. Sometimes it solves what’s needed in just 10 lines of code, whereas Claude would bloat it into 200–300 lines for no reason.

So, my honest recommendation: use GPT-4.1 in Agent mode — it’s currently the best. It’s extremely useful for visual analysis, problem detection, writing minimal code, and handling complex algorithmic tasks.

GPT-4.1 is the BEST!

rwat128 · April 16, 2025, 6:53pm

do you guys run out of the your fast requests every month, and do you use the API keys (does this become costly?) i mean, i don’t want to blow a lot of money on API, is there a strategy to save money.

<<Re: topic>>

I personally thrown my lot with Gemini 2.5 pro, when working SAS (Claude and GPT are not that great in SAS), whereas Gemini 2.5 Pro can either one-shot or two-shot it.

Nkramer · April 16, 2025, 8:32pm

My personal thoughts on the models I’ve tested (Gemini 2.5, Claude 3.7, Grok 3):

Gemini 2.5:

Generally a very smart model, structures responses in a very logical way, will create a framework for a project before filling in the details, has an awesome 1M token context which can be crucial for large projects that aren’t particularly modular. Has some issues with how it interfaces with Cursor, but it’s a great general model, probably my current #1.

Claude 3.7:

It’s very well integrated with Cursor, probably has the fewest rough edges in terms of its’ interactions with the IDE. Does a very nice job with coming up with a clean UI on the first pass. Tends to create huge amounts of code, and has a really, really bad habit of making changes that you didn’t ask for that often break multiple things. If you use a ton of rules then it can work for larger projects, but without appropriate rules/modes beware. A clear #3 currently, which just shows how quickly things are improving, as it was the #1 not too long ago.

Grok 3:

It’s not well integrated with Cursor, sometimes stopping progress before it can do anything, and sometimes giving you repeated responses. It also currently doesn’t have a reasoning mode in Cursor (even though within twitter/Grok’s app it will sometimes think for minutes and many paragraphs before giving you an answer), and Cursor only gives it 60k tokens for the context window. Despite these shortcomings, it’s also a very smart model, comparable to Gemini IMO. It tends to be minimally disruptive with the code changes that it makes unless you specify it otherwise, and it has actually solved multiple bugs in my work that neither Gemini nor Claude could fix. I’d probably rate it #2, just behind Gemini with regards to the models I’ve tested.

mhmtakecnnn’s response makes me want to test GPT-4.1 though, it sounds like it’s also a good option.

paulmontreal · April 17, 2025, 6:29am

You saved my sanity this this comment, thank you!
gpt 4.1 is a dream to work with, like having an adult in the room, not the manic lunatic that is sonnet. the fact that 4.1 voluntarily tells me what it wants to do before it does it and regularly asks me for more information or to make decisions instead of zooming off causing chaos, is a breath of fresh air. and its genuinely more supportive than sonnet, which i have learned to hate with a passion. but just did 9 hours with 4.1, there were some tough problems, but at the end of it I felt good, I learned more, I solved more issues, we made progress.

ToS · April 17, 2025, 6:55am

I find claude 3.7 sonnet in “thinking” mode really excellent, especially with certain rules.
GEMINI 2.5 has always given me problems, but I haven’t tried it for 5/6 days.

tangjun · April 17, 2025, 9:50am

G2.5 is very smart, but sometimes it will sleep.
S3.7 is also very smart, but it is too proactive and may even do things that you haven’t assigned.
G4.1 is not so smart, but it is very obedient.
O3 is very smart, but it has a bad memory.

nurullahkus · April 19, 2025, 6:41pm

I think this post should stay active always to see latest performances.

MY GO TO IS : S 3.7 thinking, not max, normal one. None is perfect though.

G 2.5 pro is so fast and smart but like a gambling. Do or fail completely. Please convince me if i miss sth for better use of it.
S 3.7 thinking is awesome but very proactive, i use it in windsurf much better. There is no max etc in it.
G 4.1 is below the standart cant compete with big models. It is only for fixes. I gave exact job to 3.7 thinking and it one shoted it in a very complex and beautiful way but 4.1 collapsed. It was 5-6 line prompt

This is the picture for me, i tried o4 mini, i still cant believe its ranking. Not even close to s 3.5 . did not try o3 yet.

spyritos · April 19, 2025, 11:19pm

I have to agree all the new openai (o3 / o4 / 4.1) models all far far out perform any model from claude / gemini even on MAX mode. For me atleast, I get very very little hallucinations, it follows my ruleset perfectly every single time. Makes very precise edits, and doesn’t begin going off on their own randomly.

Between the openai models, I haven’t played enough, yet, to know what is the best between them all, as say the other comments… perhaps the others are ‘smarter’, but the fact that they follow your rulesets makes a HUGE difference.

Perfectly I think in their agentic capabilities are much much better

bulawow · April 20, 2025, 10:09am

I started just using gemini 2.5 pro i always save prompts that could not be completed with 3.7 or 3.7 thinking and recently seems like they have did some changes on how the thinking models or maybe just gemini is integrated and some of those prompts have now been completed by gemini 2.5 pro that sonnet could not do no matter how i worded it

TheSingular · April 20, 2025, 5:57pm

Where did you find this? What does the [R] represent here?

Topic		Replies	Views
How you all choose which models to us? Discussions	25	3413	May 19, 2025
Sonnet's Reign is Over, We need smarter MODELS Discussions	1	469	January 27, 2025
What is the best model to use as an agent? Discussions	6	833	May 31, 2025
Gemini 2.0 Pro as Agent Feature Requests	3	637	February 13, 2025
Thank you to the Cursor team Discussions	0	732	April 29, 2025

Gemini 2.5 vs Sonnet 3.7 vs Grok 3 vs GPT-4.1 vs GPT-o3

Related topics