Gemini 2,5 performance: Great yesterday, terrible today

asherhunt · April 10, 2025, 1:52am

Ok fine, one more round

yes i imagine you guys have worked out vendor agreements directly and this is neither here nor there, but when you say the pricing is “not as straightforward”; this tells me you likely have a sweetheart deal w/ Anthropic (which — smart. can’t think of a better opportunity to cement mindshare w/ dev community as a model provider). But reading between the lines — with a dash of speculation — you’re also implying that possibly the deal isn’t quite as sweet with Google. Again — neither here nor there, but noted … and tell Logan I said hook it up!*

Anyways —this still underscores a point from above:

… this will inevitably impact sentiment in community chatter ( … hi … ) and your perceived value will be scrutinized so long as the product is underperforming relative to the baseline of the models your product is wrapped-around or relative to prior releases. (btw — just so we’re clear — i do not mean wrapper in a derogatory way).

But back to the core point of all of this, you touch on it here:

I think this confirms quite a bit, and unless you correct me otherwise, one takeaway I believe we all have from this is:

Cursor’s pre-processing is truncating provided context — including MAX. Best efforts or otherwise — trimming context is a feature, not a bug.

When you say “they still perform better given only what they need and nothing more” — it’s framed in such a way that one seemingly couldn’t argue with it … but the entire statement and core issue that began this discussion is: the product is simply doing a bad job at determining what the model “needs”.

And my point is: Different models have different needs.

… and the Gemini series of models happen to be the only models that perform better given more context. Feed the geese!

Now … regarding MAX — when you say “criteria is much lower”, what does this mean?

The Top K threshold is lower? (semantic search returns more results)
Are chunks bigger?
Does it stack-rank context based on semantic results then “automagically” decide how/ when to provide full files vs. chunks vs. outlines? And if so, does MAX just have a larger bias towards exposing more of the larger?

The reason I’m asking these kinds of questions is because: the better I (we, the people) understand this, the more I can understand how best to effectively engineer how i provide context. If you can offer any useful “tricks” or steps we can take to ensure we’re getting best — most coherent — output, I’m all ears.

Just one final note to consider feeding up the chain ..

MAX is a road to nowhere.

… as it functions today …

Here’s why:

1. The Fundamental Disconnect

MAX suffers from a fundamental misalignment between promise and delivery. When you market a feature based on raw numerical capability (1M tokens) but then silently filter and manipulate the provided context (context window = instructions + cursor rules + tools + provided context + query), you’re creating cognitive dissonance for your users.

We’re sold on infinite headroom but find ourselves in a house where the ceiling keeps randomly dropping. It’s like selling a Ferrari with a governor that keeps it under 55mph – technically it’s still a Ferrari, but you’ve neutered the very thing that made it special.

2. Too Amorphous. Too Cerebral

Given cursor is interfering with the provided context (different from context window), the value proposition becomes too jumbled and too amorphous — too intangible for a paying user to be left with a “wow” when they use it. Selling an expanded context window as a core selling points is also too cerebral — to quote Wolf of Wall Street: “Fuçking digits — all very acidic, above-the-shoulders, mustard shît. It kind of wigs some people out.”

it’s a … supposedly … impressive number but a confusing veneer on something of questionable, intrinsic value.

Users who get excited about “1M tokens” will quickly discover they’re paying a premium for a black box that might be using 70k tokens or less – with a feeling of disappointment when they stare at a punishing trail of useless tool calls for linting errors and unnecessary “read” calls. they they can point too and say “was that worth 5¢?” … you’re rubbing it in their face.

Re: the Black box:
Good design, is honest — Dieter Rams

3. The Perception Problem

Each interaction with MAX creates multiple moments of friction:

Value Opacity: Users can’t perceive what they’re actually getting. Is my entire codebase being considered? Only parts? Which parts? The black box creates uncertainty.
Cognitive overhead: Unlike almost any other product, users see each transaction itemized in real-time. Imagine if Netflix showed you a running cost counter for each minute you watched.
Punishment Loop: Every failed interaction isn’t just disappointing—it’s a series of small paper cuts in the form of visible tool calls and charges that remind the user “this didn’t work AND I paid for it.” (worth mentioning — this overhead is exclusive to MAX — I know i’m rolling the dice and not thinking twice with non-MAX because the pricing model is significantly better framed.

Current implementation is like the water torture of consumer pricing psychology. Drip.

Believe me — it’s not about the 5¢ — it’s about the perceived value, the resulting output and effectively being faced with a trail of receipt each time you use it. If it doesn’t work and it makes 10 tool calls and edits a bunch of files unnecessarily, it’s like staring at 10 receipts that are just there for a user to count. It’s annoying and makes users think (and anguish) too much about things they don’t actually need to care about.

Each time presents an opportunity for an end-user to question the value of the product being provided and you are ultimately leaving your implied reputation to the reliability of the model. If the model f’s up 5 times, it looks bad 5 times. When a user can directly attribute that bad response to something they paid for, it’s really an unnecessarily bad user experience.

At the end of the day, using an LLM is still like pulling a slot machine with lower than 50% odds – and while models are improving, the value proposition is in intelligence, not raw context size. Especially when Cursor is trying to be too clever by half with context filtering.

4. The Reputational Calculus

You’re currently setting up a dangerous equation:

User disappointment × Visible costs × Frequency of use = Rapidly deteriorating brand loyalty

When your product regularly shows users they’re paying for disappointing results, you’re training them to look elsewhere. This is basic behavioral economics—you’re creating a negative reinforcement loop with your own product.

The promise is too difficult to fulfill.

If Cursor manipulates provided context in ways users can’t predict, control, or even observe, it cannot honestly market MAX based on context window size. It’s a promise that – despite my determined testing – I can’t verify is being fulfilled.

Without transparency, you’re setting up users for the psychological equivalent of biting into what looks like a cream-filled donut and finding it hollow.

Short term gains vs. reputational risk reputation:

It’s totally a boost for you guys — I can run the numbers on my usage alone and extrapolate from there. But you’re burning through reputation capital at an alarming rate and it worries me for you.

Onward and Upward

Stop selling MAX as a context window size upgrade. Instead, position it as a comprehensive solution for working with large codebases—then actually build that solution with intentional features.

Some concrete ideas:

Context Visibility: Create an interface showing exactly what files and chunks are being included in context. Put them in order, prior to each “thinking” call and even give it a second before collapsing to show it’s “value”. It should show each portion of the context provided. It’s collapsable by default after a second but can be opened exposing a list of files (which can also be opened — showing either full file, partial or outline).
Context Control: Build tools letting users explicitly prioritize critical files/directories. Maybe drag/drop to sort in the input window.
Learn from users: Use ML to build context on a per-user basis.
Predictable Pricing: Move to higher tier subscription (naturally “Ultra”) just have a higher subscription tier. Use fast credits, whatever. Blend costs — eliminate the cognitive overhead. Period.
Value Anchoring: Bundle additional premium features that increase perceived value beyond just context size
More tool calls still makes sense to me

Remember: great products feel inevitable, not experimental. Right now, MAX feels like a beta feature that escaped the lab too early, and users are paying to be … subjected to water torture.

Your reputation isn’t just on the line – it’s actively being spent down with each disappointment. This is fixable, but it requires rethinking — and fast to avoid a complete depletion of reputational capital.

Strong opinions, weakly held.

Ash

Last question: How precisely can I make use of the full 1M (or 200k) context window in MAX?*

Given every prior attempt I’ve made fails — at best, I’ve barely been able to use 130k tokens due to filtering.

I’d love for a straightforward answer to this.

PS I would strongly recommend reading the book “Predictably Irrational” by Dan Ariely—one of the better books for consumer behavior/pricing psychology. Everything I’ve described above is like textbook “what not to do.”

PPS Even if you didn’t read this I still consider this a win because I was able to tie a Dieter Rams quote in with a quote from Wolf of Wall Street and it still made sense.

PPPS: I don’t actually know Logan, but if I did, I’d tell him the same thing.

LiftedTech · April 10, 2025, 2:25am

Wierd. I felth gemini 2.5 was AMAZING up until maybe 2 days ago. Then all of a sudden worse than claude 3.5 at times. Often telling me to do one of the small tasks i told it to do after completing maybe part of, or some of the tasks listed.

oooooono · April 10, 2025, 3:19am

These rounds of candid discussions have been truly brilliant,straight to the point. Great take! Bravo!!! @asherhunt

To those interested in browsing these discussions…

Here’s an interview hosted by Rajan Anandan, Managing Director at Peak XV, featuring Aman Sanger, founder of Cursor. The full version (about 30 minutes).

I’ve extracted a few key highlights:

1. Pivoting and Iteration

Initially, Cursor wasn’t focused on AI coding but aimed to assist CAD design. However, the team lacked domain expertise, prompting a pivot—a common journey for startups. Cursor prioritizes shipping usable (not perfect) products early to gather user feedback and iterate fast. While some dismissed it as a “model wrapper,” they’re now building their own models—unsurprising, given that data ownership enables model development.

2. Culture of Experimentation∂

Aman emphasized their experimental mindset: “Most of Cursor’s work is testing possibilities. For every feature shipped, 10 experiments failed.” He cited “Cursor Tab” (originally “Copilot++”) as a favorite example—it succeeded only after multiple failed attempts.

3. Bold Development Approach

The team rapidly releases functional prototypes and refines them through user feedback. This method accelerates innovation validation in real-world scenarios.

4. Hiring from User Base

Cursor unlocked a unique talent pipeline—its users. “A magical aspect of our product: users make great hires,” Aman noted. For instance, their second employee was identified by analyzing code quality among active users:

“We developed methods to assess top users’ code quality and proactively reached out. One responded—and became an exceptional engineer.”

5. $100M Revenue with Zero Sales Team

(More details in the full interview.)

This interview offers rich insights—worth a watch if you’re interested.

danperks · April 10, 2025, 1:34pm

@asherhunt Last one, lets do this!

I wasn’t trying to imply anything regarding my comments on pricing, I personally have no insight to that side of things, so could not tell you either way how things land there. I just wanted to underline the fact that comparing $ to $ is not as clear cut for us as it would be for general API usage.
I agree that performance in Cursor vs raw model performance will always be heavily scrutinised, but we build in a hope that users see past the semantics, and find Cursor as a whole package to just be a generally more productive coding experience.
We don’t necessarily truncate context, but just don’t pour context in that isn’t relevant just because there is a chunk of remaining context window we haven’t made use of. The ceiling is lower in non-Max mode, so there is a risk there that relevant context gets cut off, but Max mode will always include everything you @, and anything the Agent decides to look at!
By “criteria is much lower”, I basically mean we won’t cut anything out, as mentioned above. For non-Max, as we set a hard limit of context for reasons I’ve already touched on, so the Agent can only read smaller chunks at a time, to ensure there is more criticism by the model on wether it needs more after each chunk it gathers
The summary of Max is it unlocks the potential for higher intelligence edits, but is not a guarantee of it. We have already got a good volume of users who use Max regularly and find the cost worthwhile, but your points are valid and could be a blocker for Max adoption, so I’ll feed this back to the team
All of your concrete ideas are being actioned upon internally, so I’d hope to improvements in all of these areas moving forward!

system · May 10, 2025, 1:34pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Gemini 2.5 Pro Performance Massively Degraded (since April 22) - Thoughts no longer hidden Bug Reports	6	736	April 28, 2025
Gemini 2.5 is producing garbage as compared to 2025-04-10 Feedback	1	232	April 15, 2025
Gemini 2.5 pro 05-06: strange hallucination issue Discussion	9	333	May 17, 2025
People, Your Honest Opinion Discussion	23	2419	March 18, 2025
Gemini-2.0-pro-exp-02-05 potential Discussion	9	2185	February 18, 2025