Let Composer Pass In Images In YOLO Mode

jakekinchen · February 1, 2025, 1:11am

in yolo mode, the only way cursor knows if it is making progress with its edits is if it can write tests to verify its edits or run the program and monitor the terminal output for console logs or runtime errors. this seems to be perfect for backend development as it allows recursive and iterative verification, but frontend dev is a little tricker since we (the user) have to observe the output either on localhost or in a window being executed by our program.

the simplest way to bootstrap this currently on macos is by running a zsh script that emulates the program (headless browser/OSA script), takes a screenshot, and saves it to a directory that we upload or we copy into keyboard via pbcopy and paste it into the chat interface.

however it would be so much easier (and allow quick iterative development cycles) if cursor could simply read in base64 info in a file directory into its API call (without the user needing to link it) when it self-assesses the edits, just like it can do with text files. i would pay so much extra for this one feature.

this is great because it would essentially be a byot (bring your own tool) where you can use whatever bash method you want for screenshotting the output into a directory, where composer is instructed via cursor rules to trigger the command, then check the output screenshot directory in the codebase to assess how it did and iterate. this seems very trivial to include, but perhaps the increased image usage would be costly on the API side and that’s why, but i’d pay extra and others would too.

raw.works · February 26, 2025, 1:29am

anyone figure this out? it would be really useful for my application! i really need the agent to be able to “see it’s work”.

here’s my post about this: "see what you've done" - multimodal yolo mode composer agent

amxv · February 26, 2025, 4:18pm

if we solve this, we basically have frontend AGI

raw.works · February 26, 2025, 5:03pm

i absolutely agree.

right now the closest flow i have is e2e testing and forcing the agent to run those tests…which can be llms doing ‘browser-use’ or something similar if you want the non-determinism instead of straight playwright scripts.

related:
https://x.com/raw_works/status/1894578679264543134

amxv · February 26, 2025, 8:42pm

i hacked together a server that uses Qwen-72B to review before/after UI edits as a workaround until cursor officially adds image support to mcp tools

lmk if you run into any issues

raw.works · February 26, 2025, 10:15pm

very cool!

me21 · March 13, 2025, 9:49pm

I agree. When the Cursor is able to see the screen images, it will be possible to work in yolo mode on desktop GUI applications. And if we allow Cursor to use AutoHotKey or PyAutoGUI, it will be able to interact with GUI applications like it does with command line tools now.

amxv · April 23, 2025, 2:22pm

just made a new MCP server that solves this, check it out Let Cursor Agent take screenshots of your screen

@raw.works @jakekinchen

Topic		Replies	Views
Composer Agent Using Displayed Terminal Directly How To	1	213	December 22, 2024
"see what you've done" - multimodal yolo mode composer agent How To	12	345	May 28, 2025
Yolo mode permissions are completely broken on mac Bug Reports	2	194	March 27, 2025
YOLO Mode is Amazing! Feedback	40	31224	March 27, 2025
Support for Image Output in Terminal-Executed Commands Feature Requests	0	81	January 31, 2025

Let Composer Pass In Images In YOLO Mode

Related topics