Has Anyone Noticed Model Output Quality Is Dropping as Usage-Based Billing Rises?

I’ve been using these tools daily for months, and I can’t ignore the pattern anymore. Since the shift toward usage-based billing, especially with newer models like Claude Sonnet 4 Max and similar “top-tier” options, the quality has noticeably declined in key areas:

What I’m Seeing:

  • More mistakes in logic and code that didn’t happen with earlier versions.
  • Bloated responses full of excessive comments or redundant lines.
  • Verbose or “safe” answers that sound smart but miss the core of the request.
  • Slower response times, despite the models being labeled “premium.”

Is This Intentional?

Under a token-based pricing model:

  • More words = more tokens = more money.
  • Mistakes mean re-runs or manual rewrites = more usage.
  • Auto mode picks higher-tier models = higher burn (1).

So the question is fair:

Are these newer models optimized for accuracy or for token consumption?
It feels less like progress and more like a subtle nudge to use more = pay more.

We Need Answers:

  1. Is there internal data showing actual improvement in reasoning vs token cost?
  2. Why is output quality worse on models that are more expensive?
  3. Are there safeguards to prevent deliberate over-commenting or verbosity?

I’m not trying to stir drama, I just want clarity. If we’re being charged more for premium models, we deserve measurable improvements, not just rebranding and inflated output.

Would love to hear if others are seeing the same.


Reference:

1. Auto Mode Is "Free" – But Here's the Catch

Auto mode is free and doesn’t count against token usage. That’s been clarified.

But here’s the catch:

• Auto often picks lower-performing or slower models.
• When it does switch to a premium model, the output is longer and often less useful.
• The default experience still encourages usage of models that maximize token churn over efficiency.

So yes, Auto itself is “free,” but the broader shift in how models behave under this billing system raises real questions.

4 Likes

I see no difference in quality of sonnet 4. however I am seeing insane token usage. doesnt make any sense. I have spent thousands with cursor but I cant spent 5k a month lol.

1 Like

Correct. Quality might feel the same, but token usage is insane now. Outputs are bloated, costs stack fast. It’s not sustainable, especially when defaults push high-burn models. We need more control and transparency.

1 Like

Especially after the last update I felt like the model is doing more transactions to spend more tokens. i’ve been using it for a long time and i always use sonnet 4 thinking. the difference is very obvious.

and I can feel the processing slowing down.

I think they are doing everything they can to get more money from people.

On top of that, I really feel very lacking in codes, in applying rules, and in many other things.

1-2 weeks ago AI deleted the backend file that he edited 2 commands ago. AI replied something like we don’t use this folder :slight_smile:

something’s wrong. After finishing the project, I don’t plan to use cursor anymore. It started to make more sense to buy claude code for 200$.

Sorry for my English. there might be some mistakes because I use a translator.

4 Likes

You’re spot on, and no worries, your message is clear. Honestly, who even works for Cursor here? :joy: They act like we’re too dumb to notice slower outputs, bloated tokens, and broken logic. Feels like the whole system’s rigged to squeeze max usage. You’re not alone.

I’ve often had the frustrating experience where the output was strangely unsatisfactory, even when I provided almost identical context and instructions for the same task. This usually happened when I had almost used up my quota.

1 Like

Exactly. It’s like the closer you get to burning your quota, the dumber the model gets. Feels less like coincidence and more like controlled throttling. You’re not imagining it, a lot of us are seeing the same pattern.

2 Likes

my usage pricing token use is through the roof compared to the ultra plan included usage. I cannot imagine this is supposed to be this way. 1 refactor prompt just cost me 3 dollars (1600 insertions, 900 deletions)

3 Likes

I have Pro+ account which I am gonna cancel due to more or less what is written in this thread: prices went through the roof while quality dropped drastically. I have another, company account, which still has 500 requests with 25 calls per request limit, but doesn’t really help that I already spent 5 messages telling model again not to do something. And even 500 is not really 500 considering some models are calculated as multiple requests.

So far happy with Claude Code. Uses way less tokens, costs less, and even the same model I am using through Cursor somehow doesn’t go crazy as much as it goes with Cursor.

2 Likes

Generally speaking, it seems they definitely changed the system prompt about a day after Grok 4 was launched. It became noticeable not just because Grok 4 started working properly, but I also saw a shift in how Gemini 2.5 Pro was behaving. And honestly, I was pretty happy with these changes. All this time, my personal user rules have stayed the same.

Regarding your point about the output becoming more verbose — my experience has been the complete opposite. If you look at how my Grok 4 works now, especially after that first update, its behavior changed significantly. Right now, my Grok 4 will first spend a very long time studying the context without saying a single unnecessary word. Then, it makes its adjustments. Only after the work is completely finished does it report back on what it did. So, it basically only “talks” at the very end of the process.

If we’re talking about verbosity, Gemini has always been wordy, and Claude’s behavior seems to be pretty much unchanged.

I haven’t really seen any behavior that suggests it’s intentionally padding its responses. In fact, I’m more often dealing with the opposite problem: the agent just stops its work earlier than it’s supposed to. And I’m not just talking about the typical Grok 4 errors where it stops after the first action, but a more general “laziness.”

As for the slowdown in response time, I do occasionally see the model taking a moment before it starts working. I’m not sure when that started, but it hasn’t really been a big issue for me.

1 Like

The same feeling: responses are very bloated, with all kinds of meaningless summaries that not only waste tokens but also increase the chances of the model making mistakes. In contrast, Claude code is very concise, efficient, and accurately completes the task

1 Like

Congrats on being the chosen one. While the rest of us are getting throttled, token-farmed, and handed half-baked outputs, you’re out here writing love letters to Grok. Maybe Cursor should just ship your config to everyone, since clearly none of us matter. :grin:

Grok 4 works for me only in the new chat. It is instantly disabled in the ongoing mode. I also rarely use Claude, only when I’m too lazy to write prompt and he doesn’t need to do something complicated, or as a QA-AI.

1 Like

Ah gotcha, that makes more sense now. Yeah, Grok 4 in new chat does behave differently. And fair enough on Claude, I use it the same way sometimes too, just to handle quick stuff. Appreciate the clarification!

1 Like

Also, I have already posted my rules on Github as Agent Compass. There are already 16 stars, but I still haven’t received any text feedback. It would be interesting to read someone’s review.

1 Like

By the way, the whole repo was created in a one-shot by Claude 4, based on a quote from my convo with Gemini Ai Studio and some clarifications I gave. I was basically just the executive editor together with Auto, and editing actually took more time than the repo creating. I started to making it after only three hours of sleep, so in that state letting go of the reins felt amazing. Claude even wrote out the User Rules straight from its own context.

1 Like

I just gave a star. :+1:

1 Like

Offer premium AgentCompass rule packs tailored for enterprise compliance, legal writing, scientific code, or education, priced at $9/29 month for individuals or $99/499 momth for teams, because devs are happy to pay for cleaner, safer, and more consistent AI output, it’s a smart way to turn utility into real value. Good-luck, man!

I don’t have the resources for marketing, and I’m a pretty bad salesman when it comes to direct or cold sales. So I’m going with the donationware route.

In another repo, I used Apache 2.0 with a note that embedding it into another product or using it in an organizational context requires contacting me for a paid license. But here, I decided to go with a plain MIT license — according to Gemini, that’s the most appealing choice for organic traffic.

1 Like

By the way, I just used up my Pro+, so pay attention to the “Supporting” section in the Readme, please :backhand_index_pointing_right::backhand_index_pointing_left::joy:

1 Like