Anyone else observing a massive gap in LLM capabilities within Cursor? For example, ChatGPT 5.2 is 1000x better when used in chatgpt.com than when used inside Cursor. It’s gotten so bad that I am now copying and pasting things from Cursor GPT5.2 into ChatGPT 5.2 to fix all the problems the Cursor LLM is causing. I see this on Anthropic LLMs too. Why are the LLMs so lobotomized within Cursor? I feel like I am going backwards to the old copy-and-paste vibecoding route, these CLI agents are NOT performing anywhere close to what they should be. If I ask ChatGPT 5.2 to do something I get results that are so much better than what the same supposed LLM inside Cursor produces
Hey, thanks for the feedback.
Cursor uses the same exact models directly via the OpenAI and Anthropic APIs, so these aren’t separate “cut down” versions. The model call settings are standard too.
The main difference is usually that Cursor adds extra context:
- Code from your project
- Rules for AI
- System prompts for Agent and Chat modes
To figure this out, I’ll need a bit more info:
- Can you share a specific example of the same prompt in ChatGPT vs Cursor, and exactly where the answers differ?
- Which mode is this happening in?
- What Rules for AI do you have set up?
- The Request ID from one of the bad answers (chat context menu > Copy Request ID)
So far I haven’t seen other reports of this issue, so I want to understand what’s specific about your case.
I recommend MAX mode. Without it the 250 line limit can severely affect the output – you’ll notice the same issue with copilot
Agent in Cursor can write over a thousand lines at a time. What limit are you talking about?
what about Ask modes and Plan modes? Because at this point, I am getting much better results using GPT 5.2 in ChatGPT.com to build plans rather than calling the same LLM in plan mode
I hate to corroborate the experience of the OP, however, there does appear to be something about Cursor, that seems to make the models…..well, dumber, than their “raw” counterparts.
I had to use Claude Code alongside Cursor recently, because I was updated to 2.4.x, and it was a disaster. First the terminals had sever issues, then once I hit 2.4.21, the agent itself was completely and utterly hosed, non-responsive, and the only solution ended up being downgrading all the way to 2.3.34.
I switched to using Claude Code at first, then was using the two side by side after the 2.3.34 regression (there were still some quirky issues for a while). There DOES seem to be something about Cursor, that makes the models behave worse than they do natively. They do extra things, not asked for, step out of bounds despite rules, or are just plain dumb (i.e. just now, cursor wrote a bunch of extra code I did not ask for, and added a number of unnecessary data object mapper functions, that literally recreated the exact same data structure…just less some properties here or there….but, its TypeScript, which is structurally typed!)
Claude Code does not seem to have the same…”mental disabilities”…at least, I have not encountered the same problems. Claude Code doesn’t crap .md “documentation” files all over my codebases. Claude Code doesn’t usurp my control and commit arbitrarily, just because I had it commit once (Cursor now does this CONSTANTLY, and I cannot seem to reign it in, despite rules, explicit instructions in the same chat NEVER to commit arbitrarily on its own without explicit authorization from me each time, etc.)
I would encourage you guys not to dismiss this notion that Cursor performs worse than the raw models out of hand…because, I think it does. If necessary, I think this could be empirically demonstrated as well. Sadly, this seems to have become worse in more recent weeks or the last month and a half? There was a period of time in the fall last year, that Cursor was performing really well, models were blazing fast, and they seemed very savvy and smart. However more recently, that has all faded, and THE DUMB seems to be taking over. Its rather disappointing, but, the recent stark differences in my experience with Claude Code vs. Cursor, really hit home and make me really, truly wonder why Cursor seems to…fraught with quirkiness and problems.
in theory the main difference is the system prompt that Cursor AI adds. Same goes for Claude Code
It is possible Cursor AI team keeps meddling with system prompts. There may be a lot of junk that confuses the model
That is my suspicion as well, but I have little to back it up, its just a suspicion. I have a fair amount of rules, however I recently (just a couple days ago) went through and marked most of them as “Apply Manually”, reducing the baseline rule load to about 6-9. It varies depending on some “Apply Intelligently” rules and some that are glob based. I used to have over 20. I also spent some time to optimize the rules themselves, eliminating unnecessary replication of instructions, excessive examples, etc. to reduce the amount of context they consume. I had hoped that would help things. Sadly, this did not seem to improve the nature of the Cursor agent…it still seems to just be quirky, and continues to do annoying things like crap .md files all over the place (despite having a rule stating never to do that, and also despite explicit in-prompt instructions not to! One of the most baffling issues!)
It also does seem like past chats, have an influence on future agent behavior. I used to do some things in the past that I no longer do, and the agent, again despite rules, kept doign those things, until I went through and deleted all the old chats that had anything to do with those behaviors I no longer want the agent to engage in. I don’t know how that integrates, but, it seems to…maybe its a section of the system prompt, I don’t know. In any case, I wonder if I should just go through, and remove the majority of my historical chats… Sad thing is, the agent engages in bad behavior, then its part of that history, and I wonder if somehow that might be part of the problem? In other words: PRECEDENT?
Here is an exemplar case right here. This used to work PERFECTLY! My rule is VERY clear about how the agent should run httpie. Despite the rule being apply intelligently with all the necessary description details to have it apply whenever the agent tests, I still explicitly reference it as well because Cursor seems highly inconsistent with its rule application for any rules not “Always Apply”.
The rule, as you can see, is EXPLICIT about using –session-read-only, however the agent just ignored that, and used –session anyway. This always results in httpie trying to update the session config, which breaks it, breaking all subsequent testing attempts. Drives me crazy. So the rule is very clear, the agent MUST use –session-read-only.
Agent just decided not to actually apply that part of the rule, I guess? Why is it so quirky!?!? This used to work PERFECTLY. I never even had to reference the rule, it applied, and the agent used –session-read-only religiously. But not any more…
Why?
It’s input not output. The file_read tool is limited to 250 lines per call. Some models may call it multiple times to get more input
So if you give it a 600 line file, it may only read 250 lines; although, it could make another call for 500, or two for all 600. Max can read up to 750 in one tool call, so will always be fine.
This is the same issue CoPilot has, but at least Cursor has Max mode to get around the issue. You’ll see even Opus Copilot will return garbage at times, yet via Opencode or Claude Code it will handle the task fine. I find this most obvious when asking it to document an old system, as this will involve lots of reading of files > 250 lines long
This probably is true for rules if you still use them (and skills too)
If they’ve changed this, they need to make it clear:
I can give an example. I’m trying to do a pretty straightforward task - reorganizing my repo and refactoring my code. I’m using ChatGPT5.2 in Plan mode to create a plan to do this. Separately I am using ChatGPT5.2 through the web app to review the Cursor generated plan, and 100% of the time, ChatGPT5.2 through the web app will find plenty of holes in the plan, and is so much more thoughtful on what to do. So now I am in a habit of “Prompt Laundering” where I am routing Cursor responses manually through the webapp to get better responses that are “dumbed down” for the Cursor ChatGPT5.2.
I just opened up the forums to complain, and the models feel completely lobotomized lately. The Cursor team could be telling the truth about whatever changes they made—but if you look at it through a game-theory lens, there’s another plausible explanation: they intentionally nerfed behavior/quality to push more usage, upgrades, or retries and hit a revenue target.
Because from the user side, that’s exactly what it feels like: I’m burning way more prompts and time just to get the same changes I used to get in one pass. And whether that’s deliberate or just an accidental consequence of safety tuning, cost controls, or model routing, the outcome is the same—revenue goes up when I have to fight the model to do basic work.
This has been my feeling…it seems like I burn through more tokens doing less work now. I don’t know if it is intentional, or if its just a consequence of vibe coding without enough guard rails? Are we experiencing regressions or intentional neutering? Tough to tell, we’ll probably never get that kind of clarity about it. Regardless, its been rough lately…
This is why I really wish Cursor would fix @Docs. With that, the efficiency, effectiveness, correctness, accuracy, bervity, etc. of code implemented/changed by the agent, was VASTLY superior, to how Cursor works now. IMO its a HUGE selling point for Cursor, if they would fix that and TOUT IT! IMO a much better way to make money, well, two things:
A) Keep your CURRENT customers happy!
B) Create AWESOME features to bring in more users!
The @Docs feature was AWESOME. Its a pitiful limp little thing now, a faint shadow of its former self, and the loss of it has had an unbelievably dramatic impact on the amount of tokens it takes to do anything. Thing is, having @Docs wouldn’t really reduce total per-month token usage, really!
I’ll still spend $400-600 a month regardless! It just means that I get more work done MORE EFFICIENTLY! I still work 8-12 hours a day regardless, using cursor all day long. There is no need to neuter features to make users spend more tokens. We are going to spend em anyway! What we want is efficiency, correctness, accuracy, so we can DO MORE with the tokens we spend, in LESS TIME. We’ll still burn the same amount of tokens every month regardless. In fact, I think we might well spend MORE if we can DO MORE. As it is, I am often so hampered by Cursor bugs, that I have to deal with them, or other shortcomings, and that actually slows me down, and I am pretty sure its slowed my token burn as well…
So if Cursor really is purposely neutring features to try and crank token burn rates…they are shooting themselves in both feet and a hand, severely handicapping themselves and limiting revenue.
You can set up a /Verifier subagent with the same model and ask built-in GPT-5.2 to call it for self-verification - it will also find a ton of holes running inside a Cursor.
Why work for current users if new ones will calmly accept all bugs as features :D`
+1 to Jon comment. I’ve been using Gemini CLI, and the Gemini 2.5 pro there gives world class answers to my questions when compared to Gemini 3 Pro answers in Cursor.
Just simple questions like “where is the source of this variable and where is it referenced. lets plan a refactor to use xyz instead” seem to cause Gemini 3 Pro in Cursor to go haywire and stuff.
Gemini CLI with Gemini 2.5 pro has no problem doing this. So I have to keep using Gemini CLI (which is slow af but reliable) alongside Cursor for Cost reasons as well as a measure of diversification.


