GPT 4.5 & Claude 3.7 Cost/Benefit in Cursor

Hey all,

I want to track on something important and dive a bit deep on the details. Let’s start with a $2 response.

I mean $4

Well geez, my mistake. Why was I asking 4.5 to help structure a complex logical setup with oklch color and style swapping in Tailwind? Yes, that’s right, because I knew it would trigger a $4 opportunity for me to complain about the current handling of ChatGPT 4.5 by Cursor.

First, let me segue over to my last Reddit response to the v3.7 row happening around Cursor and Claude in Reddit:

3.7 is a comprehensive thinker and Cursor is a miser when it comes to context management. Every single one of my problems this week has been due to Cursor cutting off context to 3.7, which results in 3.7 making faulty assumptions about an API version that it would have recalled on it’s own under normal circumstances.

Cursor’s internal LLM just isn’t great at summarizing what is important.

All the proof you need exists in two places

  1. Click the LLM summary when running a commit message
  2. Click the “summarize and start new chat” button below the composer

In both cases you’ll see how shockingly over-briefed it is.

Your 3.7’s context is running on a version of that engine and trying to make comprehensive changes based on a lobotomized memory.

This is going to be a problem. The new models utilize expensive “thinking” tokens and really carefully manage their internal context to arrive at the best answer. Cursor, in order to make a profit (this is a normal motivation for a business) manages context so aggressively that what we end up with is a genius-level AI with the short term memory of a goldfish. I can’t even get the AI to stay on Style Dictionary v4 for more than 5-6 responses without manually reminding it to do so because even when I add documents to “context” in cursor the AI doesn’t see them after the initial post.

The result is that these comprehensive thinking models are capable of making deep and impactful code changes based on very little insight as to what has gone on beyond the last 5 minutes of exchange and, unlike a classic IDE, there isn’t a broad understanding of the codebase at play with a focus on logical and type safety.

Segue back to ChatGPT 4.5 now.

v4.5 is an incredibly slick programmer. It has a strong understanding of syntax, context, and purpose. As always it’s less creative than Claude and can take prompts in an extremely literal way, so it’s not great for exploratory code, but it’s a wizard at production once you have a handle on how you want something written. v4.5, it turns out, is a power user’s dream model… but it costs “200 cents” per run in Cursor.

Fine.

Effectiveness isn’t always measured in speed, it’s measured in a system’s ability to do things correctly. With 4.5 at the helm I know that I’m going to save myself a TON of debugging time.

Except that in Cursor we have two major things working against the new model:

  1. Limited context means that it can’t manage it’s own context, which works against the grain of the model’s purpose & training (everything it gets is prompting it for the first time with a summary, which is hard to think deeply on)
  2. The inability to perform deep thought on an ongoing context means ChatGPT has an extremely low bias for action in Cursor (ever told v4 or o3-mini to apply edits 4-6 times in a row, given up, and asked Claude to do it instead?). This means $2-$8 can be spent on ChatGPT 4.5 with little more to show for it than what you saw in my opening examples.

So, @danperks & Cursor… I know I’m interpreting the model here and I don’t actually have purview into what you’re doing with the internal LLM… but all evidence points to the fact that managing a thinking model’s context the way you’ve managed the older models is working against your and our best interests. You’re going to need a new approach, even if it’s a more expensive one.

19 Likes

4,5 preview is not included in fast requests?

1 Like

That mirrors my experience with the 3.7-thinking model.

You would expect new models to offer an improved overall experience in Cursor. Yet after only a few responses, the system deviates from my initial instructions, making changes it shouldn’t. This makes it frustrating to use. It’s like taking two steps forward and one back…

If unlucky, it already fails right from the start, altering complex things it doesn’t understand (this might very well be due to a lack of context).

Hey, thanks for the detailed feedback here!

You are right in that these newer, more creative models, work in a different way to the traditional models that have been available in Cursor for a while.

We have a chunk of the team already working on “un-nerfing” Claude 3.7, so that Cursor allows it to work to it’s full potential. We will likely have more to announce on this very soon, and improvements to come to Cursor to align with this shift in behavior.

For now, Claude 3.7 is best for fast development in simpler projects, where the complexity is low and Claude 3.7 can move fast without breaking too much. For more precise edits, we still recommend Claude 3.5 as the best model for large and complex codebases.

Rest assured that we are working hard on this to achieve the full potential of Claude 3.7, and to keep an eye here on the forums and on X for any upcoming changes we will announce!

7 Likes

Hey Dan, we’re back to this


where I get charged $2 for a non-agentic process.

This morning, 4.5 would have run the tokens script, evaluated the result, and worked for upwards of 5 minutes before hitting your 25 tool limit and likely would have finished before that.