Gemini 2.5 vs Sonnet 3.7 vs Grok 3 vs GPT-4.1 vs GPT-o3

Can someone please tell me which model is the best for CURSOR?

I want to stop overthinking that I am not taking full advantage of Cursor with the models I am currently using, and just be sure I am using the absolute best.

Let’s try ending the discussion with one definite model, and that’s it!

in short: Gemini-2.5-pro, currently it has issues with tooling(around 2/10 of requests), Cursor team is actively working with Gemini team to solve them, Claude 3.7-thinking is second but perfect tooling although less context than Gemini, I’m currently using Claude to plan and Gemini to act, Claude gets all the context needed with its great tooling and Gemini just edits the code with all my rules

2 Likes

Quick answer: use “Gemini 2.5”.

Gemini 2.5: It currently throws some communication errors with the Cursor server, but nothing that will leave you without responses for more than a few minutes. I recommend using a general rules prompt so it becomes more concise in its answers, as I find it quite verbose in that regard. I’m impressed by the model’s level of questioning — it doesn’t just do everything you ask like a mindless slave. Overall, I’d use it in contexts where you expect the conversation to be long.

Sonnet 3.7: I don’t use it. I saw people complaining that it acted like a “proactive idiot,” and to be an efficient AI it would need a lot of refinement in terms of context rules and prompts. I tested it a few times, and it only gave me absurd answers about my code. For those reasons, I didn’t even bother using it much.

Grok 3: I haven’t used it yet. Since it has a small context window and my project is large, I haven’t given it a chance.

GPT-4.1: I’ve used it a bit. If you’re looking for short answers, I’d recommend it — just use it with caution. It might be good for validating small ideas. I haven’t tested it in long conversations.

Take this into account: maybe I got these results because I already have specific project rules defined in my Cursor setup — that might have limited how the other models (besides “Gemini 2.5”) responded.

2 Likes

As someone who actively used Claude 3.7 Sonnet Max, I’ve completely stopped using it. Now, I only use GPT-4.1, because honestly, it performs even better than the MAX version.

Gemini still has ongoing tool issues, while GPT-4.1 is currently amazing at logical reasoning and code refactoring — it solves the required parts quickly without writing unnecessary lines of code. It’s also very effective at analyzing code.

With MAX, I often couldn’t get the answers I needed, and it would result in massive and unnecessary code changes. Since switching to GPT-4.1 (especially using Agent mode), that problem disappeared. Sometimes it solves what’s needed in just 10 lines of code, whereas Claude would bloat it into 200–300 lines for no reason.

So, my honest recommendation: use GPT-4.1 in Agent mode — it’s currently the best. It’s extremely useful for visual analysis, problem detection, writing minimal code, and handling complex algorithmic tasks.

GPT-4.1 is the BEST!

5 Likes

do you guys run out of the your fast requests every month, and do you use the API keys (does this become costly?) i mean, i don’t want to blow a lot of money on API, is there a strategy to save money.

<<Re: topic>>

I personally thrown my lot with Gemini 2.5 pro, when working SAS (Claude and GPT are not that great in SAS), whereas Gemini 2.5 Pro can either one-shot or two-shot it.

1 Like

My personal thoughts on the models I’ve tested (Gemini 2.5, Claude 3.7, Grok 3):

Gemini 2.5:

  • Generally a very smart model, structures responses in a very logical way, will create a framework for a project before filling in the details, has an awesome 1M token context which can be crucial for large projects that aren’t particularly modular. Has some issues with how it interfaces with Cursor, but it’s a great general model, probably my current #1.

Claude 3.7:

  • It’s very well integrated with Cursor, probably has the fewest rough edges in terms of its’ interactions with the IDE. Does a very nice job with coming up with a clean UI on the first pass. Tends to create huge amounts of code, and has a really, really bad habit of making changes that you didn’t ask for that often break multiple things. If you use a ton of rules then it can work for larger projects, but without appropriate rules/modes beware. A clear #3 currently, which just shows how quickly things are improving, as it was the #1 not too long ago.

Grok 3:

  • It’s not well integrated with Cursor, sometimes stopping progress before it can do anything, and sometimes giving you repeated responses. It also currently doesn’t have a reasoning mode in Cursor (even though within twitter/Grok’s app it will sometimes think for minutes and many paragraphs before giving you an answer), and Cursor only gives it 60k tokens for the context window. Despite these shortcomings, it’s also a very smart model, comparable to Gemini IMO. It tends to be minimally disruptive with the code changes that it makes unless you specify it otherwise, and it has actually solved multiple bugs in my work that neither Gemini nor Claude could fix. I’d probably rate it #2, just behind Gemini with regards to the models I’ve tested.

mhmtakecnnn’s response makes me want to test GPT-4.1 though, it sounds like it’s also a good option.

3 Likes

You saved my sanity this this comment, thank you!
gpt 4.1 is a dream to work with, like having an adult in the room, not the manic lunatic that is sonnet. the fact that 4.1 voluntarily tells me what it wants to do before it does it and regularly asks me for more information or to make decisions instead of zooming off causing chaos, is a breath of fresh air. and its genuinely more supportive than sonnet, which i have learned to hate with a passion. but just did 9 hours with 4.1, there were some tough problems, but at the end of it I felt good, I learned more, I solved more issues, we made progress.

2 Likes

I find claude 3.7 sonnet in “thinking” mode really excellent, especially with certain rules.
GEMINI 2.5 has always given me problems, but I haven’t tried it for 5/6 days.

2 Likes
  • G2.5 is very smart, but sometimes it will sleep.
  • S3.7 is also very smart, but it is too proactive and may even do things that you haven’t assigned.
  • G4.1 is not so smart, but it is very obedient.
  • O3 is very smart, but it has a bad memory.
2 Likes