Thinking models...are they actually better? Or just wasteful?

A poor quality codebese definitely imposes a fair number of drawbacks and impediments, no question. A couple weeks ago I went through some fairly hefty refactorings of the codebases I am working on now. The code went from ok (mostly big monolithic files) to very good (not excellent yet, but, we are a startup and have to move VERY fast here). I created rules to guide the process, enforce principles, policies, patterns, practices, etc. on the process, and things were well refactored. Code properly follows SoC & SRP now, and is a bit more loosely coupled, etc. That has indeed helped.

The problem is that a thinking model thinks, regardless, and those steps take extra preparation time (the agent has to create the prompt for the thinking steps each time..so you have extra, and often longer, ā€œGeneratingā€¦ā€ or I think now it says ā€œPreparing next stepsā€) which is on top of the thinking process itself. Even if the thinking isn’t all that long (sonnet usually has 5-8s thinking sessions, but it can have many of them throughout a single request; gpt has much longer thinking sessions front-loaded), each of them that occurs, takes more time, and the whole process just seems to generally lag.

When I use a non-thinking model, both the agent prep step, and then getting the result back from the llm, just happens. There isn’t any lag, the model just does its thing normally and responds. I don’t think I have a NEED for the thinking, but even on what I think are very mundane tasks, right now using gpt-5, the thinking sessions are often 10-15 seconds long each, sometimes longer than that…which confuses me. I mean, I’m doing things like updating a specific endpoint of a specific controller with a specific requirement. Spending 20-30 seconds waiting for thinking to complete for such a mundane task, is very confusing and I warrant unnecessary. (I should actually flip to sonnet non-thinking for a bit here, and try a similar task and see how long it takes…I suspect it would be pretty much instantaneous.)

Anyway. I have a reasonably well architected and designed set of codebases right now. Its all loaded into Cursor from a single root folder I’m using as a workspace, where all my cursor rules and MCP config is. So everything can be cross-referenced across repositories. I make EXTENSIVE use of documentation (@Docs). I am careful with the context I attach, and usually will @reference in the prompt as appropriate. I usually have a pre-defined plan that I’m working with (I used to have .md files with multi-phasic plan descriptions, today I’m primarily using Linear stories that I planned out yesterday), so there is hardly a lack of information about exactly what I need done, how, dependencies, related code, etc.

Even the linear stories are created by me prompting the agent and it using the MCP to generate and manage the stories in Linear. So the detail level of the stories is plenty sufficient, since the agent has access to all the relevant context, cross-referenced across all the various projects involved, including references to any docs, etc.

I don’t think I have a problem with a poor quality codebase or anything like that. In fact, I think the problem is, I have already done all the thinking…and I really just need the model to ACT.

Hmm. Another quick example. Asked the agent (+gpt-5-fast) to check a specific story I’m working on, and identify any discrepancies with the work that was actually implemented. I like to document differences between the original story, and the work actually done, as its often useful in the future (especially if someone else joins the project and takes over a certain part of the code!)

Very simple prompt, roughly: Examine story, identify discrepancies with actual implemented code, create comment on story documenting the differences (but only if there ARE differences.)

That’s it, not much.

Sitting here, probably 40-50 seconds in, and this thing is still thinking. Still thinking… Why? I mean, this is literally NOT rocket science. Heck its not science at all. Its a basic differential and writing some human-readable text content.

I feel this should have been done done in 30 seconds or less. Its now finally getting the issue from linear….now its creating the comment….ok, now its finally done. Over a minute and a half for that… The process of getting the issue and adding the comment, took a small a mount of time, but that’s expected given its MCP calls to an API. MOST of the time was spent thinking, which was over a minute.

:melting_face:

1 Like

Thanks for the detailed explanation — it’s very important. It’s a good idea to include additional documentation beyond the code. When I have such content available, I do it, but from now on I’ll make an even greater effort to remember this option.
Indeed, when I worked with different libraries, I made sure to download all the source code into the project folder so it could easily know how to work with them — and it truly was amazing.

1 Like

Ah, Thank you for your decision to make Sonnet Non-thinking the default. This is the way I accidentally learned that thinking mode provides less than I expected.

Too much text wall, so just an opinion:

  • gpt5 is basically o4 (not yet really ā€œdoneā€)
  • o3 was good for planning, bug fixing, optimizations

Had GREAT results for fixing performance logic of Sonnet generated code!!

(originally coded with Claude Code (better results than Cursor)).

GPT-5 is free? Not as far as I know. What did you mean by that?

GPT-5 is free in Cursor for the launch week (up to a generous fair use limit!)

1 Like

As for the general topic, I do think a lot of users instinctively use thinking models when they deem their task ā€œdifficultā€, but in the majority of cases, they would’ve got a quicker and equally problem-solving solution without thinking.

Additionally, for Pro users and those who want to get the most usage for their money, using a thinking model can cause the token usage to pile up very quickly, causing 2-3x or more usage per request, and therefore drastically reducing how far the usage stretches.

GPT-5 seems to be a somewhat divisive model across the community, but it is still a strong coder and is cheaper in API pricing than Claude, so is not an unfeasible ā€œdefaultā€ for users to move to, with the option as always to switch to a different model if they aren’t getting the responses they want.

1 Like

Hey, do you think even high-reasoning GPT in Cursor is cheaper than Claude 4 sonnet?

If not all, then most benchmarks show the superiority of reasoning and hybrid reasoning models over classic ones. Most likely, your tasks are too simple and/or you are very good at prompt engineering.

I guess the thinking mode allows Claude to be more purposeful, rather than being really smarter. If we compare the classic Claude and the reasoning Gemini, then Gemini builds a better code architecture.

I wrote myself a task manager on .NET in May. The only problem was the GUI, or rather trying to get the Agent to do exactly the design I want. I must used Gemini Pro Preview.

It depends on the model and on knowing what to ask and how to ask; o4-mini was excellent for GUI.

I wanted to make a (not standard?) two-layer GUI that includes circular elements. And all the models that I tried broke the interface, and did not do what I wanted.

That’s a specific thing it might have helped to split it into parts.
A circular component has always been a difficult problem; all the frameworks I know don’t truly support a circular component, only a rectangular one neither web frameworks nor regular C# frameworks.
To make a circular component, it’s basically various techniques that, behind the scenes, are still dealing with a rectangular component.

What are these tests and benchmarks about though? Coding? Or other things? Scientific research? Medical research?

My tasks are quite often far from simple, but as I said, I spend time planning, and then the plan itself becomes my prompt. So perhaps it is prompt engineering. Planning has been key, IMO, to controlling and corralling the non-deterministic tendencies that seem to cause models toā€¦ā€hallucinateā€ or as I like to describe it…meander (change bits of code here and there that have nothing to do with the prompt) or bulldoze (change massive amounts of code to fit the models "own opinionated natureā€ when the changes have nothing to do with the prompt.) I have experienced both, and my solution to these was to create plans, usually multi-phase plans for the whole body of work, then work through the plan one phase at a time by effectively using each phase as a prompt.

As for more purposeful…they certainly seem to simulate a purposefulness, however, as a thinking model ā€œthinksā€ if you watch what it is doing, you can see it often ā€œpondersā€ incorrect paths of thought, and if you don’t stop it there, such a model will often meander or bulldoze when it actually acts. I don’t have that problem with non-thinking models…

So I would be very curious to know, what are the actual test cases for the benchmarks that show the superiority of reasoning type models? Are they primarily implementing code in those benchmarks? Or are they for other endeavors?

It may well be my prompt engineering that is affecting my results here, though. I don’t know for sure. I call it planning, as, the prompt is really ā€œImplement phase X of the plan at @file|ticket ###ā€. I guess it is probably more than that as well, though, as I have a reasonable number of rules now, that grows day by day, that are also included that help manage how the agent and model follow plans, and what architectural and design paradigms I want them to follow, etc.

1 Like

I have also been questioning the value of thinking versions. And getting a similar result from non thinking versions of the same model. My conclusions are its me that has to do more ā€œthinkingā€ about whether the task at hand requires a thinking model version or not, and then choosing the correct model version.

After significant GPT-5 usage the past 4 days ( 110,835,409 tokens ) I switched back to ā€œautoā€ and I can clearly recognized gpt-5 response in auto now. It’s good, especially at deep dives into bug fixing, but has its limitations, especially if the chat thread gets too long.

That being said GTP-5 is a hell of a lot closer to my expectations of what a good LLM coder is supposed to be. I have gotten a lot of milestones checked off in the past few days for sure with gpt-5

@jrista I use a similar workflow based on plan, refine, execute, review, complete (not a good acronym), but I also use the thinking/sequential thinking mcp tools along with vibe check and a few cursor rules to remind it to use the tools and stick to one task at a time workflow etc. I use an overall spec document, a task list document and a task specific refined ā€œguidelinesā€ document and I get fantastic results on non thinking, like, single shot great code, working tests etc.

When you introduce thinking, it goes grinds to a halt, often goes backwards on a task and breaks stuff, does things not on spec or asked for, and it often completely ignores the workflow, ignores the specs/guidelines, makes arbitrary decisions, that are worse, and general scope drift - especially the dreaded ā€œthe goal is complex so I am going to decide on a new goal which is to stub it and then claim it’s doneā€. The thinking models actually amplify all of it’s bad behaviour (cutting corners, misinterpreting or ā€œbadly reinterpreting because it knows betterā€, skipping instructions)

I think it depends on your preferences. As a someone who has been a developer for 20+ years and has deep understanding of my particular subject matter, I can be very prescriptive and accurate about giving good upfront guidance. When this is the case then the thinking model is hinderance if not noticeably worse than doing it manually. For potentially scenarios where one might have less experience or be less invested in spending the time upfront to break the task down properly, then the thinking model will do some of that for you and give you better results. But both approaches at the same time clash heavily and work against each other.

For myself, I get way better results spending more time speccing up the task, building the task lists and such upfront and then executing than I do relying on the thinking model for figuring out what to do.

1 Like

Before I delve into responding more deeply. You mention ā€œthinking/sequential thinking mcp toolsā€ā€¦ I’ve never heard of that, but you’ve peaked my curiosity! Can you supply an example?

These are the ones I use but there might be other similar ones:

1 Like

This is the great prompt engineering. I usually just use Markdown markup and bullet lists. If the task is too large, I ask the Agent to create a ROADMAP.md, I’m editing it and the Agent is following the Roadmap.
Also i use my Agent Compass.

1 Like

I started out with the markdown and bulleted lists. But the agent actually started moving beyond that on its own (originally with claude-4-sonnet), and I kept letting it, then started including little nudges to get it to try making each plan better and better. Once I started putting everything into Linear with the MCP, the agent then again, on its own started doing even more.

I am using gpt-5-fast right now, but the stories it creates, automatically had additional sections added to them. Scope, which lays out which projects should be involved, which should NOT be involved, and possible caveats to scope rules. Verification sections that detail out effectively ā€œacceptance criteriaā€ for testing requirements and criteria. It sometimes will drop in a mermaid diagram into them, etc.

So the planning has become very highly refined. A lot of the time now, once I have my epic and task stories, I simply need to tell the agent ā€œMove story XYZ into in-progress, assign to me, and implement it.ā€ Usually, the ENTIRE body of work will be done in a single request for that prompt.

However, I have started doing some follow up checks, to make sure that the implementation fully complies with the story requirements. I do find that there are little discrepancies here and there. Often with testing, so I need to figure out how to get the agent to write the Verification sections of the stories with more precision there, but there are also sometiems other discrepancies with the story vs. implementation. Sometimes it just seems like its little oversights (non-determinism of the model?), other times, there are good reasons (i.e. some other groundwork would have to first be applied to another project before every single requirement of this story COULD be completed). In those cases, I instruct the agent to add additional stories and associate them as necessary to track new requirements, etc.

Its been an ongoing process though, working with the agent to build out plans for exactly how to implement, and then having it actually implement each story. Having stories, though, did seem to open up a new level of detail, and now its quite often just instruct the agent to implement story X, a the entire implementation happens all at once, every aspect of it.

Once an entire epic is fully implemented, I then go through a ā€œTerminationā€ phase to pretty much verify everything really does work (i.e. run the full suite of tests, etc.), close out all my working files and agent tabs, verify nothing is uncommitted that should be, etc. before I start the next epic (lot of this, is to make sure I don’t end up with any rogue context that accidentally gets attached, or somehow use the wrong agent tab to do work that ends up with unrelated context from prior in the chat, etc.)

So specificity and context has really begun to aggregate in the stories, and then, the agent can just do the work and get it done quickly, wholly and completely, and I’m blazing through tasks here. (Only thing is, GPT-5 only has thinking models…really want to get some non-thinking versions of them, see if I can get through everything even FASTER!)