I am mad as hell. I was working on my code with Claude 4 Sonnet, and everything was going well. I blew past my quota in a week and was into usage time. It is what it is.
But then, almost like throwing a switch, things started going sideways. Code was being generated that was absolutely sub-standard. Context was being lost. Instructions were being ignored. False paths were being traversed (bad ideas, bad patterns, bad architecture), and when corrected, were tried again a few requests later.
A decent amount of wrestling with this, and I got frustrated with the model and asked which it was. It told me that it was Claude 3.5! I realize Claude 4 is trained on 3.5, but I specifically asked it again and pointed out that Cursor said it was calling Claude 4. It assured me that it was 3.5! It has assured me of this many time. See the screen shots.
This is being reported as Claude 4 Sonnet, but it’s clearly Claude 3.5
And trust me - I used 3.5 for quite some time before they released Claude 4 Sonnet, and I can tell the difference. I showed the code to an associate, and he, too, said, “no way this is Claude 4.”
Now maybe this is a fallback and you’re just not telling us, but you’re showing Claude 4 usage for code that is not only insufficient, but is WASTING MY TIME and BREAKING MY PROJECT.
This is just NOT ACCEPTABLE.
Financial aspects aside, this has to stop. I’d rather be late on my deliverable than waste my time and then have to spend the same amount tomorrow to clean this garbage up.
This is still happening! I have racked up a LOT of charges for “Claude 4” that are being served by 3.5, and my code is SUFFERING as a result. I’ve spent the whole day today wrestling with substandard code generation from standard prompts I have that work BEAUTIFULLY with Claude 4.
If this is genuinely happening and you basically know you’re getting claude 3.5, then I would just leave Cursor. Or post a bunch of proof comparing responses straight from Anthropic and from Cursor, to help others understand what is going on. I wouldn’t put it past Cursor, who knows what is really going on. The people on Ultra should definitely know what they are getting since they are paying much more than the Pro people.
As an ambassador, I’ve been closely following the discussion around the perceived performance of “Claude 4 Sonnet.” To better understand the issue, I ran a few tests and made an interesting discovery that could explain the community’s confusion.
When asked directly, the model labeled “Claude 4” sometimes identifies itself as “most likely Claude 3.5 Sonnet.” It seems to be aware of a discrepancy between its system prompt and its internal knowledge base.
This isn’t an accusation, but an important finding to share with the team. This kind of labeling discrepancy can understandably lead to incorrect expectations and frustration when performance varies.
My queries have never had the “most likely” disclaimer. I asked directly and was insistent, and even suggested that perhaps it thought it was 3.5 because it was trained on 3.5’s beginning data set.
It assured me that this was taken into account, and is not the case. It assured me that Anthropic encodes the model name and version number, and that it was most certainly 3.5, and its data set had no knowledge of a version 4 (which one would expect).
And someone keeps marking this thread as solved, but it is is not, and I remove that mark. Awfully annoying, that.
hi @dogberry about once a month someone sees hallucinations by an AI model and uses that as claims about Cursor doing something bad behind the scenes.
This has been debunked by many team members at Cursor and we do directly call the model you selected. On many occasions I have also personally asked team members to check the users request and this was indeed the same model they selected.
There is zero incentive for us to give you an much older model like Sonnet 3.5 and present it as Sonnet 4. I can ensure you the model you selected is the one we pass to the AI providers API.
The only difference is Auto, which is routed to different providers and models based on availability, load and task assigned by user.