GPT-5 paralyzed by excessive thinking?

Where does the bug appear (feature/product)?

Cursor IDE

Describe the Bug

Ok, I just had a rather rogue problem with GPT-5, that I think burnt up a whole bunch of tokens. This really doesn’t make me happy, quite the opposite actually. I had finished a bunch of work about 15 minutes ago, ran my test suite, and there were about 30 failing test cases (out of 1500.)

As is my normal course of action, I grabbed the error messages from the terminal, pasted them into a new agent chat, gave it a prompt with the project reference as context, since test errors were strewn throughout the project (some low level changes were made to data services and a few logic services), and let it rip. I then stepped away for about 15 minutes.

I just came back, and GPT-5 had not fixed ONE SINGLE TEST!! NOT ONE! The entire chat was a cycle of 23s-1m12s thinking cycles, separated by a few reads of files. This went on, and on, and on, and on, and on…and NOT ONE IOTA of actual unit test fixes were done.

GPT-5’s thinking is WILDLY EXCESSIVE, totally wasteful, and seems to get it into trouble on relatively mundane issues like “In @project/ these tests are failing. Fix the tests, but be careful not to change any of the @Recent Changes as they are intentional.”

I reverted back up to my prompt, switched to Claude4 Sonnet, and the thing thought for a grand total of 21 seconds, and fixed all the broken tests in about 2-3 minutes…

I…WAT!!!

I have run into this issue with GPT-5 more and more as time goes by. I am not sure if the model itself is being tweaked, or if there is something in the Cursor agent and how it uses the model that is causing this (I’ve upgraded several times since first starting to use GPT-5), but its really quite ridiculous. To have so much time spent thinking…I mean, I’d have to say half or more of that ~15m period of time, was the model “thinking”, with little spurts of reading files in between? So, 8, 10, 12 minutes of thinking?? To do NOTHING?

Whenever I switch back to claude-4-sonnet (thinking and non-thinking), the experience is so different. This model, and its integration with the Cursor agent, is…I guess I would have to say very refined. It just works. It doesn’t fuss or muss about anything. It isn’t super picky about how you craft your prompt…even if your prompt isn’t highly accurate, the model seems to get what needs to be done and does the job well.

I honestly don’t know if that is just a model thing? Perhaps this is purely a GPT-5 problem and there is nothing that the Cursor team can do to fix it. In that case, I guess I just have to deal. I am trying to figure out GPT-5 just in case anything happens with Anthropic here, but boy, it has not been an easy journey. GPT-5 is no Sonnet/Opus replacement, IMO.

If this is a matter of refining the Cursor agent and how it uses GPT-5, though, then I feel some efforts do need to be made here. Because near 15 minutes of the agent and model chugging on a problem without one single edited character, burning tokens the entire time, is a very serious problem. That is not just waste at a personal level, using up my allotment of tokens from my plan on needless useless “thought” by the model, but it wastes global resources as well, which is a growing problem with AI usage in general, but particularly with AI use for writing code. Numerous articles have covered the sheer amount of waste derived from vibe coding efforts and general agentic IDE usage overall.

Anyway, this was a particularly egregious case, and in part I guess, because I had to step away. Usually when I see GPT-5 going down a wasteful path I’m here to stop it, maybe they would all end up like this if I did not. In any case, I feel there is something very wrong here, it seems fundamental to the nature of the current agent->GPT-5 integration, and I am hoping that its a matter of refining this integration. Because that’s the word that comes to mind, every time I switch back to Sonnet: Highly, optimally refined.

Steps to Reproduce

Not exactly sure how to replicate. It seems to be a somewhat arbitrary problem, but it occurs enough, and seems to be a deep enough problem, that I’m writing this.

Expected Behavior

For the agent to actually solve the problem at hand, without spending exorbitant amounts of time “thinking” (perhaps just introducing GPT-5 non-thinking options would solve this problem right off the bat, as it seems to be fundamentally related to thinking cycles with GPT-5.)

Operating System

MacOS

Current Cursor Version (Menu → About Cursor → Copy)

Version: 1.4.5 (Universal)
VSCode Version: 1.99.3
Commit: af58d92614edb1f72bdd756615d131bf8dfa5290
Date: 2025-08-13T02:08:56.371Z
Electron: 34.5.8
Chromium: 132.0.6834.210
Node.js: 20.19.1
V8: 13.2.152.41-electron.0
OS: Darwin arm64 24.5.0

Additional Information

I apologize. I was rather ticked off when I canceled the GPT-5 session, and strait up reverted to the prompt, chose sonnet, and re-ran the prompt (which immediately started fixing the issues and was done just a few minutes later.) I should have captured a screenshot of the issue, and I’m kicking myself now for not, since I…well, DO NOT want to try and replicate it again, given how many tokens I think were just burned uselessly the first time around.

Does this stop you from using Cursor

No - Cursor works, but with this issue

1 Like

Stayed on Claude since switching for the testing. This is a really wonderful experience. GPT-5 definitely feels primitive, in comparison. I think two things need to be done to make GPT-5 a real contender for long term, agent heavy, daily use:

  1. Give us some non-thinking models! GPT-5 thinking is the most wasteful thing I have ever seen since I first started exploring LLMs a year and a half ago or so now.

  2. Refine the agent→model integration. GPT-5 seems to lack even the most rudimentary “common sense” more often than not. I suspect a lot of that has to do with how its driven by the agent.

Anyway….Sonnet is such a pleasant experience, in comparison.

Being a novice at vibe coding I thought I was getting something wrong with GPT-5. I switched to it when it came round, having gotten dizzied by Sonnet’s heady enthusiasm. GPT-5 proved to be too reticent, not giving away too much of what it was doing to my codebase. After a long while everything got tangled up. What used to work wasn’t working anymore. I switched back to Sonnet to clean up the workflow. I am now timid to try the new agent, though I am still curious about what makes it a front-edge model.

I started using GPT-5 as soon as it was released. My initial experiences were quite terrible. I am not sure if there was just something going on with the model on inital release, but after a few says things settled a bit. I’ve had good and bad with GPT-5.

What I’ve learned, is that it is a VERY different kind of LLM than Claude. OpenAI and Anthropic do not do things the same way. I’ve used both quite heavily now, on a wide range of tasks, some requiring thinking, most that just do not require thinking (especially if you have good rules).

What I’ve learned, is that Sonnet (and Opus) are rather forgiving models. They are not as strict in their expectations from prompts as GPT is. Sonnet is able to infer from prompts more, and even if your prompt is not perfect, so long as the key details are in place in the prompt, nuances of human language generally do not trip it up. Combined with the depth of the integration between Cursor’s agent and Claude, the two make for a very effective, and also more self sufficient, tool to implement software.

GPT on the other hand, is a VERY, VERY FUSSY model. It has lots of expectations, requirements, demands, that you MUST meet when writing your prompts. It seems to be swayed and unduly affected by nuances of human language usage, that it, as a cold, emotionless, non-sentient computer algorithm, should be unphased by. However, it is quite phased by them. Nuances of human language…the softness, or harshness, or precision, or lack of precision, or aggression or lack thereof, your accuracy on varying levels, and a whole host of other nuances about exactly how you write your prompts…all have an effect on GPT-5 and how it reacts to your prompts. GPT-5 even prefers that you use XML to structure your prompts! So not only do you need to be writing them at just the right level of human language neutrality to make sure they it does not go flying off the handle…you can’t just write plain english either (not if you want best results!)

There are thresholds, and if your specific wording for a prompt crosses one of these thresholds, it can send GPT-5 off in a WILDLY different direction. Too soft, and the model could totally misunderstand you and do something wildly wrong. Just a bit too direct and explicit, and it suddenly spends a minute at a time thinking, and does a lot more research before it actually makes ANY kind of move. Too aggressive a “tone” but with just the right amount of details, and it might go off on wild tangents despite the precision and accuracy of the INFORMATION in the prompt. If you eventually get frustrated at the models FUSSINESS, and that frustration shows through in any way beyond “barely measurable” and the model will freak out and spend EXORBITANT amounts of time thinking, researching, planning, thinking more, planning more, reading files and checking git and…gone….its lost in the rabbit hole. (BTW: You WILL get frustrated. Its part of the GPT design philosophy.)

The problem with GPT-5 seems to be finding EXACTLY the right way to do u, v, w, x, y AND z, in every single prompt, to make sure that GPT-5 senses the right human language nuances, REGARDLESS of whether the actual INFORMATION and DIRECTIVES in the prompt are there. This, IMHO, makes GPT-5 a very unpleasant model to work with.

I started out using Claude. I fiddled with ChatGPT a bit over the last year and a half or so. I always went back to Claude, though. However Anthropic is designing their models, they are superior, on many levels. I’ve never had trouble prompting Claude and getting a good answer. When I spend more time thinking about the INFORMATION and DIRECTIVES in my prompts, Claude gives me better answers, better results. GPT, though, and for that matter, Gemini, and to a degree even Grok, they are not the same. They seem to get tripped up by nuances of human language, which can make them miss the explicit information, queries, directives, and other details that ACTUALLY MATTER in your prompt, and change how they react and respond.

For an interactive model of human knowledge, making it fussy about how its asked questions, just makes the model hard to use and frustrating. IMO, OpenAI needs to figure this out. Sure, if you put in the effort, you can get better results…but that effort has to be invested very single time you craft a prompt.

If you are a vibe coder who does not have programming skill. I recommend two things:

  1. Develop some programming skill.
  2. Use Claude.

Thank me later.

Thanks for the heads up. I was laughing all the way through your post. I know my way around a few languages, certainly enough to be a manager of a coding AI. My decision now is to use GPT-5 for planning and thinking until I am comfortable handing over my stuff to it. Thanks again.

There are many times after fighting with GPT I switch to Opus or Sonnet with a “Does this make sense to you, can you sort this ■■■■ out” and it comes back, summarizes what I actually want succinctly, and gets on with. It definitely understand actual language far, far, far better.

1 Like

Without question. The more I look into how to use GPT-5 effectively, the more it seems like it kind of has its own expected language. It prefers xml structure, it likes “neutral” language, because it will try to infer from how you use language (soft, hard, aggressive, frustrated, irritated, happy, not, etc.) and then change how it responds. Which, IMO, is a ridiculous way to build an LLM.

With Claude, how the language is used, doesn’t matter as much. It matters, you can ALL CAPS and that will provide some emphasis, but, much more in the way natural human language works.

Sure thing. I am hoping that the agent integration with GPT-5 improves, I just don’t know when that will happen, and until it does, its starting to seem it might be a bit more trouble than its worth, except in specific circumstances. It IS more surgical than Sonnet, and that is helpful when I need targeted edits. Sonnet, though, produces better code overall. I just used sonnet throughout the last couple of days, and its been a pleasure compared to the GPT-5 stint. Ended up with some very nice, well architected code, that is well organized, follows SoC properly (i.e. domains don’t throw HTTP exceptions, stuff like that), maintains SRP for all modules, classes and functions. First couple of days since GPT-5 came out, that I’ve been really pleased with the results.

This topic was automatically closed 22 days after the last reply. New replies are no longer allowed.