No self-directed problem-solving iteration with GPT-5?

I’ve been using Sonnet for about 4-5 months. It has a very good integration with Cursor, which I guess is to be expected. I am quite used to its ability to iterate on a problem, and with self-directed effort resolve issues, retry, resolve issues, until the whole problem is solved.

I am noticing that GPT-5 does NOT do this. It will try something, recommend some actions to the user, and stop. Then the user needs to either instruct GPT-5 to do its own actions, which it MAY do, then it will…and will stop. The user has to be constantly prompting GPT-5.

I have had a number of rogue unit and e2e test issues today, and trying to use GPT-5 to resolve them has been almost like trying to resolve the issues manually myself.

I’ve had other distractions, but have finally been able to sit down and just iterate on this, and I realized I was still using GPT-5. Switched it to Sonnet, and its still cranking away, solving problems entirely on its own, iterating through every issue in my test suites, without me having to interact with it at all. Been cranking for, a while now…10-15 minutes?

Is there a reason GPT-5 is not capable of this kind of self-directed problem solving? Is it a design decision, an oversight, or…is it just that the integration between Cursor and GPT-5 just hasn’t reached the level that Cursor and Claude have?

1 Like

Sonnet nor Opus do any of what you said, in fact they leave tasks unresolved until you tell them to validate against the specification. This is the norm every single time.

GPT5 when clearly instructed using documents has much better tendency to broadly review and finish.

This may be down to a prompting habit and supporting evidence (documents, tasks)

Claude Opus and Sonnet are an embarassment at the moment, even gpt5-mini outperforms it in breadth of investigative scope and intelligence

I think you are reading too deeply into a single term. I am not stating that Sonnet itself has self-directed capabilities. What I AM saying, is that the combination of Sonnet plus the Cursor Agent, is RADICALLY more self-directed in its ability to keep iterating on a single problem, from a single prompt, than GPT-5.

GPT-5 with the Cursor Agent, tries once, if it doesn’t solve the problem, it spits out a set of instructions and expects the user to take over. This is very annoying.

Sonnet (or Opus) will ITERATE on the problem. Without user interaction. I would share the chat I had where Sonnet solved a whole series of testing issues, but it would be too long to share here, as it just kept going and going and going, iterating over issues until they were solved, then moving on to the next, without me interacting with any of it until it was done. It actually left two issues unresolved, which it stated were due to:

  1. DOM instability
  2. Timing issues

Two problems it probably couldn’t CONCRETELY resolve given the nature of the tests.

I think the “self-directedness”, comes from the combination of the agent AND the model, not directly from the model itself. Hence why I asked in my OP: Is this a design decision, an oversight, or IS IT JUST THAT THE INTEGRATION BETWEEN CURSOR AND GPT-5 JUST HASN’T REACHED THE LEVEL THAT CURSOR AND CLAIDE HAVE?

Because I know what I’ve experienced. For some reason, when it comes to complicated issues that require some repetitive trial and error iterated on over and over to resolve, Sonnet NAILS it, and GPT-5 farts out some manual instructions after a half-hearted attempt fraught with a little bit of hallucination… :man_shrugging:

Further, this is HARDLY my experience. GPT-5 is not revolutionary in any respect. Its been largely evolutionary, with incremental benefits over Claude’s offerings in most cases. GPT-5 has the capability of being more surgical and explicitly targeted, which has its uses for sure. And it is a good model, its capable.

However it is not running circles around Claude’s offerings by any means. Incremental. That’s the term that comes to mind with GPT-5. I don’t think prompting is the key, really, as I’ve run the exact same prompts against both, to see which performs better. A LOT of it is task dependent, which model wins. There is no clear significant outlier in all respects for the most part. GPT-5 seems to have a deeper understanding of software architecture and design, but again…that provides incremental benefits.

I have thus far, NOT seen any kind of revolutionary improvement with GPT-5. Certainly not across the board. The main area where I’ve found GPT-5 is notably better than Claude is working in existing Next.js codebases, especially with complex UIs and components. GPT-5 is just more surgical…it can handle targeted refinements better than Sonnet. However, when it comes to backend, I’ve found I often have to flip back and forth between the models depending on the exact nature of the task, as there just is no fundamentally clear winner between the two.

the issue is taht Anthropic has regressed their model capability because they’re dealing with capacity constraints. I am using Claude Code 100 usd and I have noticed considerable downgrade after week 1 (which is now nearly 4 weeks ago) when their server issues started happening. What is now their Opus4.1 performs like Sonnet 4 did a while ago.

Anthropic has serious problems to deal with and I feel ripped off, their service quality has drastically reduced and its capability in terms of intelligence is no where near GPT5.

I never said GPT5 was a revolution, i’m stating that OpenAI has much more capability than Anthropic and that they’re in deep trouble. I will not be using Claude any longer because this is not the first time they rip off their customers with model degradation and then NOT tell their customers about it.

I have not noticed any degradation in the capabilities of Claud’s current models. The model is the model, they aren’t dynamically scalable. This is why there are DIFFERENT models for different scales with some providers.

I don’t know why you are experiencing what you are experiencing. My experiences are quite different. Further, I am talking about Cursor, not Claude Code. I can’t speak to Claude Code’s capabilities or changes in them. For Cursor, the agent integration with Claude’s models seems far superior to the integration of Cursor with GPT-5. The latter is capable, but not nearly as self sufficient as Claude, when it comes to what the Agent can do with it, it seems.

:man_shrugging:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.