Thinking models...are they actually better? Or just wasteful?

NOTE: LONG POST!

I first started using Cursor around four or five months ago. First time using it, its kind of magical, seeing what it can do and how easy it seems to prompt the agent to create complicated things. I started with creating Angular and Next.js web apps, and with Next.js in particular, its still pretty amazing how the agent can tackle very complex prompts and do a wonderful job.

Over time, and not all that much it seemed, the novelty wore off, and I settled into the job. Working the job, and while what Cursor and its agent and the llm can do is still amazing, its not quite as amazing as the initial take. Especially with backend code, where architecture and design matter a lot, and when you have very specific needs, particularly when your codebase gets past certain thresholds (even if its all agent-generated code), the challenge of keeping the agent and llm on the rails and on task, without them going “scatterbrained” and bulldozing your codebases all the time, definitely increases.

To combat the challenge, I started operating in a particular mode: Plan, Research & Refine, Act, Complete, Terminate. Or PRACT, for short. :wink:

Planning is becoming a much more significant part of the process, as is refining the plan, getting it added to my ticketing system (Linear) via MCP, often writing out a local plan .md document for local reference and use. Plans are usually multi-phasic, or multi-task under a story in Linear.

Acting, is generally just me guiding the agent through a plan, either by working through a Linear Epic and child stories, or by working through a multi-phasic plan one (or maybe a couple) steps at a time. I may involve multiple chat tabs while acting. Having one actively working through a phase or two, another I’m preparing to act on the next, a third is often open for research or review tasks related to the work being done or other sideband things (reviewing someone else’s work.)

Completion, often comes iteratively as well. After I act, I’ll “complete” by tackling the mundane stuff like fixing linter errors and fixing formatting, running and verifying (and fixing) tests, etc. Then I’ll commit before moving on with the next “Act” step for the next phase of the planned work.

Termination, is where I’ll clean things up and get ready for the next big body of work. Closing out chat tabs, terminals, files, etc. to just clean slate so that junk from the last body of work isn’t hanging around confusing me or interfering with the next body of work.

OK! So, I now have this process. Its becoming fairly engrained, as keeping the agent and LLM on the rails, on task, without meandering or bulldozing all over my codebase, is a more challenging task when you actually have a decent sized code base. Anyone who’s done agentic coding for a while should know, you can go from zero to quite a lot, rather quickly!

A couple weeks ago (maybe not quite), I was sitting watching Claude 4 Sonnet “think” for rather long periods of time, before it would actually do any work, then think some more, then do some more work, etc. etc. I was poking around Cursor’s docs pages, and came across something where they stated that “Thinking models cost twice as many requests” or something to that effect…which suddenly gave me REAL PAUSE!

I have just had my first REAL month (and a half, almost) of really HEAVY agent-centric work. I do EVERYTHING in the agent now. I burned through all the requests Cursor Pro offered, then started burning pay-as-you-go tokens. Then upgraded to Cursor Pro+ when I discovered it was an option. Burned through ALL of that it seemed in less time than Pro! Ended up racking up well over $100 in additional paygo costs. Then I discovered that I was effectively paying for, for all intents and purposes, HALF the number of requests I thought I was, because all my work was using a thinking model! I ended up upgrading to Cursor Ultra plan last week.

So I started wondering if I should switch to Cursor 4 Sonnet NON-thinking. I figured, why have the model “think” when I am already doing a lot of thinking? Further, I’m a heck of a lot smarter than an LLM. The only advantage an LLM actually has is knowledge. Why would I rely on an LLM to “think” (an act it really actually isn’t capable of, its really just a feedback loop) when I have over 30 years of programming experience, 27 years of in-industry career experience?

So I switched to the non-thinking model. Whoa! Suddenly it seemed like things were happening so much faster. I’d write a prompt, and bam I get an answer, right off the bat. I might write a more complex prompt that required some grepping or other investigative file work or git history trolling. But, it just HAPPENED. Right off the bat. Beyond the initial “Generating…” lag (which I believe is the agent itself preparing the full prompt+system prompt to send to the LLM, or things of that nature), when the model was actually invoked, it just responded. Boom. Bam. Work done! Next?

Another thing about the non-thinking C4 Sonnet model…it did NOT seem to do a worse job than the thinking model. On the contrary, it just seemed to do what I asked, with less deviation or meandering from the specified task than the thinking model. Given that I plan out in a fair amount of detail what I need done, having the model just do what I asked and get it done immediately, was really nice!

I’ve been working this way with non-thinking models for, about, maybe not quite, two weeks now. Well, at least, I was, until the last two days and GPT-5. Once I started playing around with GPT-5 (which was an unpleasant experience for other reasons, which I am honestly not sure whether to blame the model, or Cursor 1.4…they both seemed to coincide, and something seems…not quite right ATM), the whole “thinking” processes came back. GPT-5, when it acts, is a bit faster than C4 Sonnet. However, all the “thinking” periods counteract that, and I once again feel like I am being throttled by thinking. I am not really a fan of that anymore. The novelty of seeing your model “Think” as COMPLETELY worn off, and it now just feels like wasted time and lag.

GPT-5 is “free” right now (or supposed to be, read a number of reports here that it seems to be billing some people), so the “thinking models cost twice as many requests” factor shouldn’t actually be taking effect…BUT, it is still, as far as I know, a factor! You pay one “request” for a thought cycle and then you pay another “request” to get the actual output. At least, that’s what the documentation seemed to be saying. I don’t like that much at all! Especially since all the “thinking” doesn’t really seem to actually enhance the results? At least, the difference between C4 Sonnet “Thinking” and C4 Sonnet, seemed to be that the thinking model meandered more, and would go off task or not quite do EXACTLY what I needed more often than the thinking model.

So I’m now in a conundrum here. Is a thinking model, actually, truly…better? Perhaps it has to do with how you work? My PRACT(ical) approach involves me doing the thinking and researching and planning and refinement of the plan of action I want the agent and model to enact. Perhaps that has negated the benefits of a thinking model?

I am curious if any of you other heavy Cursor Agent users out there, do anything similar to me with the self-directed planning process, and what your experiences with thinking vs. non-thinking models might be. It looks like ALL GPT-5 models are “thinking” models. I am not sure what I think of GPT-5 yet (maybe the issues are just Cursor’s initial integration with it), but…it does give me pause.

My current experiences are that the thinking models meander and bulldoze more than the non-thinking, the thinking models are clearly twice as expensive (you spend twice as many requests using one!), the “thought” processes slow things down, which when you are already doing your own planning, can really greatly slow things down. I think it would be sad, if all new models were ONLY thinking models, as I find that they may not be as effective or fast as a non-thinking model, at least under certain use cases or circumstances…

1 Like

Been working with the gpt-5-fast (thinking) model consistently here since I first posted this a couple of hours ago.) The thinking processes of the model are DEFINITELY slowing me down. I’m waiting a lot more than I was with C4Sonnet. The LLM itself, seems a bit faster than Sonnet, but with all the extra thinking steps, the overall process is definitely slower, and that is definitely frustrating.

I was looking at the OpenAI pages for the new GPT-5 models. It does not appear that “thinking tokens” are actually required by the model, so I am curious why Cursor only offers thinking models for GPT-5. It looks like EVERY GPT-5 model only has a thinking version, even mini and nano!

I would like to see how GPT-5 performs without any thinking tokens involved. Aside from the fact that Cursor consumes two requests for every request sent to a thinking model, and that it seems to meander more (or even go wildly off track…that was the case all day yesterday), the overall process of using one seems enough slower vs. a non-thinking model that I question the value in thinking models. I don’t like spending double the requests, just to sit and wait while the darn model simu-cogitates through more rudimentary thought processes than I already did myself when I crafted my prompts.

I like having the model just directly respond to my prompt, and respond immediately. I can’t say that I’ve noticed that all the thinking cycles with Claude 4 Sonnet, actually produce better results than just asking the model directly and getting results directly…

1 Like

I don’t know what your use case is, which programming language you use, or what kind of things you do.
But there are things that just aren’t realistic with the regular models especially when you get to heavy operations, serious data gathering, programming languages that were less present in the training data, and so on.

1 Like

Does the thinking actually help in those areas?

My experience so far, indicates that the thinking processes don’t actually seem to provide much benefit. I primarily work in web technologies, but full stack (i.e. node.js or nest.js on the backend, javascript and typescript mostly, maybe Go, possibly C#/.NET, etc.) So, I guess, it could be that these are WELL modeled technologies and languages, which may be why I am experiencing what I’m experiencing.

I could see less well modeled things benefitting from some thinking, though.

For majority of tasks non-thinking models are more than sufficient and cost much less. Which is why we made non-thinking Sonnet the default model of the two.

There are some tricky tasks or programming languages,… where thinking and even most advanced models are necessary.

4 Likes

Will you guys be introducing non-thinking GPT-5 models at some point?

Its likely all API available models will be available. though some are not yet in API or not reviewed yet.

1 Like

I tested in C# the model doesn’t think, it just gives me code that doesn’t compile. GPT-5 high model was the first time a model actually managed to write good C# code.
I think this is a good example for someone who doesn’t understand why it’s necessary.

1 Like

You are telling me that NO model, not even Claude 4 Sonnet, can write compilable C# code? I’ve had the Claude web agent write C# code that compiled just fine in the past… This truly baffles me. C# is a very structured language with a lot stricter rules than TypeScript (which I use far more often these days.) I started writing C# code in the late 1990s (before the language was even released to v1.0)…I have a hard time believing that no model except GPT-5 High can handle C# today…

I have been using Sonnet 4 for over a month writing only C#.

Thinking for auto mode option please!

1 Like

Absolutely. My boss is willing to pay a good amount for the model’s cost, but Anthropic didn’t get the job done. When you want a simple feature, Anthropic is also nice, but with complex things Blazor development, and especially developing an add-in for Word/Office until now, no model succeeded, not even Opus.
Since last Thursday, this is the first time I’ve had a model that does it without errors. Even O3 couldn’t do it.

1 Like

Happy for you, but for which C# framework?

I am mostly using it for Unity game development, but also have done some .NET applications. Maybe it is not good with certain frameworks, but C# as a programming language it certainly is fine with.

@MidnightOak We always need to remember when you need a simple function, all models are good.
When you want something where the model has to read five files, then find some distant variable, then check how it works with the database, find some code online, and plan an entire implementation or sometimes follow a function that calls another function, and so on through about ten files that all need to be considered and then write a new feature based on that which works well, that’s where we start measuring which model is successful and which is not.

What does not measure a model:
When you’re writing a brand-new program — all models are good.

What does measure a model:
Changing an existing thing.

That’s why online benchmarks are worthless because almost all models are good at creating a new React component in a single file.
But let’s see the models take an existing React component and increase its font size oops, they get stuck in a ridiculous way.

All I am saying is that other models, like Sonnet 4, can obviously program C#. I am not vouching for how complex of tasks it can handle. And I have had Sonnet 4 go through dozens of files looking for variables and methods and finding bugs with no problem.

To me, a good model means —
(and this is something I’ve never had in C# until now) —
it’s not just about writing code.

I have a huge project that’s been developed over ten years — a massive piece of software.
The client sends me a bug or a feature request.
I myself don’t remember where the code is.
I send Cursor/GPT-5 everything the client wrote, including the screenshots they sent.
It identifies what the client is talking about, finds which file it is out of thousands, makes the required change, and then the code actually compiles and the client is happy.

I review the code — it matches the existing style and looks good to me.
I’ve never had that before.
Not even close.
Of course, that’s also related to Cursor.

2 Likes

Sounds like GPT-5 is good, especially for big tasks like that.

Thinking actually provides one huge benefit (I am not sarcastic here at all): if it is thinking a lot (especially you are talking about Claude 4) or discussing a lot with itself, there are two options: either it is far out of the current context, or you would do your code a favor with some refactoring and figuring out what is wrong. I can elaborate more of course, but I am even using local LLMs to understand if they can do mundane things like writing unit tests or whatever, and if I see they can’t, it’s not them, it’s me, time to fix things, and test it by asking them to do the same.

LLMs are not magic, and it is becoming more and more important understanding how software engineering works as your codebase grows otherwise you are setting yourself up for some surprises.

Hmm. I wonder if there may just be a lack of training material for things like Word/Office integration. Some things like that, are more often kept in private repositories, and I wonder how much of the more arcane use cases like that, don’t provide enough model training opportunities.

I am curious, do you use the Documentation Indexing and context features of Cursor? Any time I’m working with something less well modeled, and even when things are fairly well modeled, I will look for, index, and reference the necessary docs in my prompts. I wonder if that might help you.

Early on I had a lot of trouble with getting things implemented correctly. For example, ShadCN, which we are using in all of our web apps right now, is not well understood inherently by any model as far as I’ve seen. So it didn’t do a good job, and it would often take a lot of effort to get the LLM to generate the right code.

Then I found the documentation feature of Cursor. I indexed all the ShadCN documentation, then also indexed Tailwind documentation, then also some of the tertiary docs. Referenced all that I felt was necessary in the next few prompts, and Cursor was able to resolve ALL the issues AND implement my original requests with ease.

Context IS KING here. If you are not supplying the agent and llm with the right context, then yes, they can struggle. Docs is a critical feature. One of the reasons I’ve stuck with Cursor despite some real struggles the last week. Claude Code, for example, does not support indexing docs like Cursor does. I’ve become so reliant on this feature now, because it REALLY DOES help a TON, that I can’t just switch to Claude Code here.

I would be willing to bet, you have not indexed either enough, or just the right, documents and referenced them properly in your prompts, to alleviate the challenges you are facing with C# code. If you haven’t already, I’d research that feature and pack in as much documentation for what you are doing as you can, and reference them often. Unless you are somehow context-limited, docs never hurt, even if a model is otherwise well-trained on a given technology.

1 Like