The Best AI Models to Use (and Why Claude Isn’t One of Them)

Context

I’m not entirely sure why everyone is so obsessed with Claude. Sure, it’s solid as an SWE, but it’s not the smartest model out there. And if you take a look at the benchmarks, the “Thinking” version isn’t that far ahead of the regular one. Certainly not twice as good.

From what I understand, based on reading its “Thinking” outputs, that mode acts more like a TODO list — it lets the model be a bit more efficient, that’s all. Also, Claude has actually been the most expensive model for over a month now. Even on the old plan, it used twice as many requests after Anthropic’s discount promo ended. Did no one else notice this? It’s written right there in the UI! (well… was before yesterday)


o3 is smarter than Claude, cheaper than 4-Tnk, and it makes good use of tools. But yeah, it’s still pricey.

Auto: I don’t know how it works for you, but for me it’s always GPT-4.1 now. Great for simple tasks — super fast and very capable as an Agent. It’s as free as Gemini 2.5 Flash, but faster and smarter.

o4-mini: Not as good as o3, but has similar Agent capabilities. It’s weaker when it comes to handling context (in my case, it once removed important functionality when it was only supposed to fix a bug or keep the logic intact).

Gemini 2.5 Pro: Better than everything else. It did have major issues with edit_tool, but that seems mostly fixed now. Sometimes it still fumbles, but it’s rare. For the tasks I give it, it consistently outperforms o3. And it’s cheaper than o3 (calculated in AI Studio using my Subscription Usage Summary) and much cheaper than Claude-4-Thinking (Cursor team even acknowledged this in a public apology: 225 vs 550).


So which models should we use?

  • If you’re on the Ultra plan, there’s really no reason to use anything but Gemini 2.5 Pro Max. Maybe try o3-Pro or Opus, but personally I don’t even touch it — I’m afraid it’ll burn a hole in my pocket. If Gemini can’t handle a task, switch to o3 or Claude to get a different take on the problem. But more likely, the issue is either poor prompt engineering, poor context engineering, or the task is simply too complex for current LLMs.
  • If you’re on Pro or Pro+ and willing to pay a little extra to get the most out of Cursor, I’d recommend sticking with Auto until he starts to get stupid; then switch to o4-mini; and if both fails, go with Gemini 2.5 Pro, even though it’s about 3× more expensive. Also, I highly recommend using Gemini when making key architectural decisions or planning major refactors.

P.S. I don’t have a long professional track record to prove my word is trustworthy — but over the past couple of months, I’ve have a few projects in Cursor that you can judge for yourself. I’ve also have four more in private repos at the moment.

1 Like

It’s surprising that with over 200k lines, you only needed 52 tabs.

1 Like

I only make occasional edits — over 99% of the code in my current projects is written by AI. And you might notice, only about ~20% of the code actually gets accepted. It is also likely that the statistics do not include canceled rows.

I’ve always postponed proper training to become a full-fledged developer, but thanks to the “Year of Agents” it turned out to be useful that I started learning prompt engineering even before ChatGPT was released.

1 Like

Thanks, i will try gemini 2.5 pro instead of claude-4 :thinking:

Say bye to claude-4, it’s so expensive now

1 Like


Actually, for a long time after Claude 3.7 came out, it was not very exciting, compared to a very enthusiastic period of 3.5. Because there was a period of time when 3.5 was even more enthusiastic to do things that you didn’t ask it to do. It was very comfortable to use at that time. Although it was easy to have project errors, it would make you worry less. Now only 4.0think can do this. 3.7 and 3.5 are not satisfactory. If I have enough credit, I will definitely choose 4.0. Although I can use 3.7/3.5 to submit issues multiple times to fix (save money), 4.0 currently has the best user experience

1 Like

On the contrary, I don’t like that Claude is overly proactive. During the “unlimited” period, I only used it when I had a task that truly required it to be a bloodhound.

TL;DR

OP prefers:

  • Auto (GPT-4.1) for daily use (cheap & fast)
  • Gemini 2.5 Pro for serious/complex tasks
  • o3/Claude only for alternative takes or when the others fail

Gemini 2.5 Pro is the one OP considers the best overall.

1 Like

I decided to write a changelog through Claude 4, since 4-Tnk was doing the task. Understood the pain of Claude’s lovers; I won’t do it anymore XD

UPD: And after reading the changelog, I saw that 4Tnk also demolished some of the useful functionality. :man_facepalming:

I don’t think you’re the right person to judge what is good and what isn’t. By your own admission 99% of your code is AI generated. Which means you don’t understand 99% of your program.

When you start navigating code bases with thousands of files, millions of lines of code. The models you suggested completely collapse, they start hallucinating, get stuck in thought loops, etc. (FYI this code base is written by engineers with 0 AI generated content).

My advice to you is, stop trying to come from a point of authority when you have little to no experience in the domain.

What’s been demolished? This post makes zero sense. If you’re referring to context it gets compacted.

You don’t read compiled bytecode either, do you? I’m simply working at a higher level of abstraction. That doesn’t mean I don’t understand the program’s logic and data flow.

They may successfully mess up even within a 40k context :grin:
And I’m already working on a one of the solution to the problem.

This post is a recommendation based on the value-to-cost ratio I get from using neural networks in my work in the current Cursor IDE pricing policy.

I asked for a fix to the handling of a single variable. He went ahead and changed the handling of three — two of which were critical. In a 600-line script.