Agents doing weird things since Mon-Tues 9-10th?

Where does the bug appear (feature/product)?

Cursor IDE

Describe the Bug

Agents are doing some weird things. I can’t even describe clearly what is wrong. Since about yesterday, agents hallucinate, don’t follow rules, loose track of what they are doing, make up anything and everything, game tests in any possible way they can to just move on. Not onyl in tests, tehy are doing anything to just get to some weird goal. make up new goals, missintepret what I said, totally missinderstnad what I asked for etc.
It’s just REALLY weird. It’s like they are still smart, but just doing soooooo many things wrong that worked absolutely fine just a few days ago. Many models are doing it, hence I think it’s something in the cursor harness layer. Example, I ahve a list, in a file that is working on, suddenly just skips wokrin on the list, makes it own list, forgets my list, just totally forgetting or making up new tasks and suddnely working on something totally different. It has VERY hard time to just remember what it’s working on. Maybe this is the biggest problem I’ve seen last 36hours.
Does anyone else see the same trend?

Steps to Reproduce

Just do any complicated work, ask it do do some requirents, design, coding, and just work with it. It will start to hallucinate and do weird things as if we’re back in may/june last year.

Expected Behavior

Being abel to follow a specification and task list.

Operating System

Windows 10/11

Version Information

Version: 2.6.18 (system setup)
VSCode Version: 1.105.1
Commit: 68fbec5aed9da587d1c6a64172792f505bafa250
Date: 2026-03-10T02:01:17.430Z
Build Type: Stable
Release Track: Default
Electron: 39.6.0
Chromium: 142.0.7444.265
Node.js: 22.22.0
V8: 14.2.231.22-electron.0
OS: Windows_NT x64 10.0.26200

For AI issues: which model did you use?

gpt codex 5.3, 5.4, auto, gemin 3.1 pro various claude models

For AI issues: add Request ID with privacy disabled

c8f66985-4f9a-4b30-9d80-030e59248346

Does this stop you from using Cursor

Sometimes - I can sometimes use Cursor

Hey, thanks for the report.

What you’re describing, the agent skipping items in your task list, forgetting what it’s working on, and making up new goals, matches known summarization issues where important context gets lost in longer chats. We’ve seen similar reports from other users lately.

A few things that should help:

  • Start new chats more often before the context window fills up and summarization kicks in. Use @ mentions to quickly pull in relevant files.
  • Put key instructions in a .cursorrules or project rules file since those persist across summarization. That way the agent keeps critical context even when the conversation gets shortened.
  • Make sure you’re on the latest build. You’re on 2.6.18, which should include recent improvements.

To dig deeper into your specific case, it’d help to have a single concrete example, one conversation where things went off track, plus the request ID from that specific message. A broad “everything is weird” report is hard to act on, but a specific case lets us trace what happened.

The request ID you shared finished without any server errors, so this is likely about what context was sent to the model, not a service outage.

The team is aware of these summarization quality issues. No ETA yet, but your report helps us prioritize. Let me know if you can share a more targeted example.

Yes, I agree. But, this didn’t happen a few days earlier. There’s a significant degradation of capability, in the last few days. It’s very, very noticable. (I’m a senior developer with 25+ years of experience, but I have vibe coded “only” for a year now).
It seems like context problem, and it might be, but I think even in short sessions, it’s been having problems.
this example is from a session that’s now probably run long and been summarized, but it’s a perfect example of what’s going on.

The wrapper we had in scope was the rule generator (new_scraper_rule_gen). We already added one test for it: test_rule_generator_field_extraction_records_event in test_ai_usage_real_flows.py — that’s the “one more test, one more wrapper” we did.

When you said “run it so they all are run at least once,” I should have taken that as: run the existing tests (so that wrapper, and the others we have, run at least once). I didn’t forget which wrapper to test; I misread “run them” as “build something that runs every call site” and added the big jog-all file.

So:

  • The one wrapper we were supposed to test: rule generator — and it’s already covered by that single test.

  • What I’ll do now: Remove the jog-all test file and leave the one rule-generator test as the only new addition. No other new tests, no “any one is ok” — that one wrapper was the one we targeted.

If you actually meant a different wrapper (e.g. sitemap_analyzer, embeddings, or another) as the one that still needs its first test, say which one and I’ll add exactly one test for that one only.

ID 761cf668-76ce-48ee-938b-935b58a72d8b
I few months ago in the fall, I wouldn’t have written like this, but last few months, this was ok, I didn’t have to be more precise than this. It understood me. Last few days suddenly it doesn’t There’s a huge degradation. Like the memory or summarization or something is broken.

The example above is just one, of many, many occcasions of similar problems in the last 48hours.

In 144d82a6-f55e-4f2e-b3a0-4f08972e8674 I just had to give up. Normally I don’t run sessions this long, of course. But I couldn’t get to a point whewre I could summarize it, I thoght. Eventually I just had to abandon it and start over… Wasted 36hrs of work now on this ai-cost-calc thing…a few days ago, it would have been done in 2-3 hrs.
For the first time ever in my project, I felt I had to actually tell it exactly what to write and code :frowning: (a simple singelton wrapper function/class and to then use that instead of direct calls to an api key).
I’m still working on it, and it’s still massively going of tangents in various directions at random times. Currently on gpt 5.3 codex low, used auto a few hours ago, don’t think 5.3 made any difference.

Thanks for the extra request ID and the concrete example, that’s exactly what we need to investigate.

That rule generator example, where the agent basically reinvented the task instead of just following the list, is really telling.

I can see you’re working with very long sessions, 36 hours on a single task is a lot. This is where summarization behaves the worst right now, it drops task lists and specs, and the agent starts making up what to do next. The team knows about these summarization issues. I don’t have a specific timeline yet, but your report with the request ID helps us prioritize.

As a workaround for now, try splitting the work into shorter sessions, start a new chat every 30 to 50 messages, and keep critical instructions and task lists in a rules file, not in the chat, so they survive summarization.

Let me know if the behavior is still noticeably worse with short sessions compared to last week, that’d be an important signal that it’s not only summarization.

1 Like

Yes it was 36hrs (I slept also) and worked on other sessions also). But even switching session, it quickly does the same (I tried). And being able to switch, is why I use the tasklist for it to follow and keep track of where we’re at etc, but it didn’t even manage to do that…and this used to work just a few days ago (ie..e it could summarize and keep going no problem). Before this problem occurred, long sessions was possible due to my separate tasklist for it to follow and stay focused.
This is now broken and it seems the task-file is no longer able to keep it from hallucinating/loosing focus, and on auto I can’t keep track of the context window either, but if I choose a specific model, now I’m wasting money instead since it can’t do long sessions and 50messages is not enough to close out an individual task often.

I…ahem…I switched to to Kiro just now and it cleaned up all the mess, built my thing, refactored everything and ran the tests. It took me 30min and I was done, what I failed to do in 36hrs… :frowning:
Let me know when you have fix for this issue, as for now, it’s impossible to work on even smaller tasks if it needs to read a bit of the codebase.
/Per

Got it, Per. It really sucks that you lost so much time.

Your feedback about task files is a really useful signal. The fact that even a separate task file didn’t help the agent stay focused is different from just a long session, and I shared that with the team along with your request IDs.

I’ll update this thread when there’s progress on summarization and keeping context. I can’t promise a specific timeline, but your report helps us prioritize.

This topic was automatically closed 22 days after the last reply. New replies are no longer allowed.