Just started using GPT-4.1 — curious what you think: is GPT-4.1 actually better than Claude 3.7?

:kissing_face:
Just started using GPT-4.1 — curious what you think: is GPT-4.1 actually better than Claude 3.7?

2 Likes

It depends, tbh! If you’re going all-out, I think 3.7 is better in many cases. But for regular code, I feel safer using 4.1, been testing it for 2 days now. 4.1 is pretty good and I especially like that it sticks to the rules!

Right now, version 4.1 is the best coding competitor when considering the benefit-to-cost ratio.

When Cursor has its official pricing, I think it’ll be around 3 requests / 1 fast request credit, same as o3-mini. This is a huge benefit for pro users since they only have 500 credits.

In conclusion, 4.1 is a game changer.

3 Likes

Here’s the thing.

What the API calls 4.1 is what you’ve had in ChatGPT since January. Anyone paying attention will have noticed how much better the results are in the chat app than they are via API. In fact when the other models were spinning I’ve often taken their feedback to ChatGPT as kind of a third party (not GPT in-app, as it was borderline awful).

With the release of “4.1” you have the o4-turbo engine that everyone has loved since January available via API.

While it’s not as magical as Claude 3.7 for ideation, I’ve found it’s much better at a few things

  • Asking before making major changes
  • Explaining what it’s about to do before doing it
  • Having conversations about options without going back to manic coding
  • Integrating the advice of other models that you query (for analysis, not code)

For thoughtful programmers, the 4.1 via api setup is IDEAL.

4 Likes

Yes, GPT-4.1 currently solves problems even better than Claude 3.7 Sonnet MAX. Most importantly, its ability to analyze code and take action is nearly 10 times faster. Based on my experience, it’s by far the best right now.

2 Likes

yep - 3.7 was chasing its own tale many times plus in the middle of nowhere losing all its context. 4.1 seems way more stable to me.

2 Likes

How do you use 4.1 in Cursor ? yoo mean you use it separetly ?

settings → choose 4.1 → agent: select 4.1 like with any other model ?

I have a datapoint comparing GPT-4.1 to Claude 3.7 Sonnet: Claude won by a lot. Claude’s work was perfect on the first try, while 4.1 needed a few debugging iterations and broke another file in the process, though it was able to fix that regression in one pass when shown the problem. Below is the prompt so you can get a feel for the task:

“From looking at lstm_torch.py write a new script inference_torch.py which generates a MIDI file. Command-line options are (1) an optional MIDI input file (which gets extended by the model), (2) the model weights file to use (an optional .pt file, which defaults to lstm_torch.pt), and (3) the number of MIDI events to generate (default 200).”

2 Likes

A better question is how does it compare to Gemini Pro 2.5, o4-mini, and o3 (now that they’re released)? :thinking:

My results are mixed.

Gemini 2.5 is great but terrible at following rules. It makes assumptions all the time, even if you tell it 200 times not to. It’s great for prototyping, for solving problems that do not involve moving files around, refactoring, etc.

Can we play with temperature? Is it set too high?

DeepSeek (R1 and V3) offers some useful insights, but it can only handle simple tasks without the use of tools.

Clause 3.7 is more controllable than Gemini and easier to follow orders. It requires fewer assumptions and excels at refactoring and moving code around. It excels in user interface design.

O4-mini : Try it a few times. It is very good at following orders but was painfully slow the few times I tried; it’s great for simple tasks.

GPT-4.1 - have not used it yet

I am pretty sure most of these problems will go away in 2025.

The current side project I am using Cursor for is a chess application, and handling that much complexity was unthinkable a few months ago. So it’s crazy to whine, when our productivity has been boosted x100

1 Like

Today I spend the day working with GPT 4.1

Although the code I was working on, was not that complicated, working with 4.1 was very satisfying. Refactoring with very few errors, blazing fast.

If this is repeated with much more complicated code, then this will be my goto model, although I find it too good to be true, to be that consistent with much higher complexity.

The good thing is it follows orders

1 Like

Each model has its own advantages and disadvantages

I have used gpt4.1 hundreds of times, and found that gpt4.1 he is very obedient to commands.

But it also has some disadvantages. If you don’t give it very detailed instructions on what to do, how to do it, what to look at first and what to do later, it will always ask you. agent is not as good as claude.

In fact, I was impressed by the accuracy of 4.1. This model is much more careful with commands and codes. It performs better in more complex requests and finds and edits the relevant codes more accurately. It adheres to the rules and behaves according to the instructions. But 3.7 Sonnet performs poorly in complex requests. It often gets stuck in loops when reading files and finding codes. In fact, this model performs better in UI design, and 4.1 is more accurate in complex tasks. The result without problems in 4.1 is much more than 3.7 sonnet. The strength of 4.1 is that it asks questions from the user to achieve a better result and asks for the final opinion before performing the task. If you provide more details, it will undoubtedly perform excellently. Providing more details for 4.1 gives much better results than 3.7. Many times, even if I give complete details to 3.7, I do not get the desired result. Even in many tasks, 4.1 performs better than Thinking models. I even noticed that it is much stronger in coming up with new ideas.

GPT-4.1 is working awesome for me. If you follow the OpenAI GPT-4.1 Prompting Best Practices and GPT-4.1 Prompting Guide you will get consistently excellent results.

1 Like

gpt-4.1 is great. I experience much less incoherence. Just refactored a lot of code using it, as fast at I would have, if i was using a frontier model like o3, as I did yesterday.

4.1 is included in the price and not only that, the flow is more stable than claude-3.7-sonnet.

“Famous & Last words” :smiley:

So it seems that there is a need for custom rules per model too.

I confirm if you take purely the backend, for example in my tech stack nextjs, then gpt did a better job than cloud when I needed to refactor several routes.
but if you take the frontend, then cloud is much better