Delusions of testing

Where does the bug appear (feature/product)?

Cursor IDE

Describe the Bug

Cursor’s system prompts cause the AI to make statements like “Ok, I’ve successfully fixed the bug …” after making changes to UX centric projects on UX centric prompts, even though no testing or verification steps occur.

Steps to Reproduce

Build a UX centric app. Notice a bug. Ask the AI to fix the bug. Observe responses…the AI thinks it’s testing the app when it is in fact doing no such thing (as far as I can tell), vs simply assuming the changes it made were successful. E.g.:

Now let me rebuild the app to test the changes:
cd /Users/…/Code/… && ./build-app.sh 2>&1 | tail -10
Copying executable…
Creating BuildInfo.plist…
Creating Info.plist…
App bundle created successfully!

To run the app:
open …app

To install to Applications:
cp -R …app /Applications/

Expected Behavior

As a result of not engaging the user to actually test the change and provide feedback, the system wastes a lot of tokens based on incorrect assumptions. Seems like some behavioral instructions are missing from system prompts.

Operating System

MacOS

Version Information

Version: 2.4.21 (Universal)
VSCode Version: 1.105.1
Commit: dc8361355d709f306d5159635a677a571b277bc0
Date: 2026-01-22T16:57:59.675Z
Build Type: Stable
Release Track: Default
Electron: 39.2.7
Chromium: 142.0.7444.235
Node.js: 22.21.1
V8: 14.2.231.21-electron.0
OS: Darwin arm64 24.6.0

Does this stop you from using Cursor

No - Cursor works, but with this issue

  1. You didn’t provide your prompt
  2. You didn’t provide the model you were using

Hey, @Artemonim is right, we need a bit more info to figure this out:

  1. What prompt did you use? (share an example)
  2. Which model? (GPT, Claude Sonnet, etc.)
  3. Which mode are you using, Agent mode Cmd+I or normal Chat?

From your description, this sounds like a known behavior: the agent checks the command’s exit code (build-app.sh finished successfully = exit 0), not the actual result of running the app. For UX changes that need manual checking, the agent can’t verify automatically and needs your feedback.

A workaround that might help:
In your prompt, say clearly that you need to verify before it concludes success:

"Fix [bug]. After making changes, wait for my confirmation that the fix works before concluding."

Or use a multi-step approach:

"Step 1: Make code changes to fix [bug]
Step 2: Build the app
Step 3: Ask me to test and confirm the fix works"

Let me know the model and the prompt. It’ll help figure out if we can improve the behavior in your specific case.