Let Composer Pass In Images In YOLO Mode

in yolo mode, the only way cursor knows if it is making progress with its edits is if it can write tests to verify its edits or run the program and monitor the terminal output for console logs or runtime errors. this seems to be perfect for backend development as it allows recursive and iterative verification, but frontend dev is a little tricker since we (the user) have to observe the output either on localhost or in a window being executed by our program.

the simplest way to bootstrap this currently on macos is by running a zsh script that emulates the program (headless browser/OSA script), takes a screenshot, and saves it to a directory that we upload or we copy into keyboard via pbcopy and paste it into the chat interface.

however it would be so much easier (and allow quick iterative development cycles) if cursor could simply read in base64 info in a file directory into its API call (without the user needing to link it) when it self-assesses the edits, just like it can do with text files. i would pay so much extra for this one feature.

this is great because it would essentially be a byot (bring your own tool) where you can use whatever bash method you want for screenshotting the output into a directory, where composer is instructed via cursor rules to trigger the command, then check the output screenshot directory in the codebase to assess how it did and iterate. this seems very trivial to include, but perhaps the increased image usage would be costly on the API side and thatā€™s why, but iā€™d pay extra and others would too.

2 Likes

anyone figure this out? it would be really useful for my application! i really need the agent to be able to ā€œsee itā€™s workā€.

hereā€™s my post about this: "see what you've done" - multimodal yolo mode composer agent

1 Like

if we solve this, we basically have frontend AGI :stuck_out_tongue:

i absolutely agree.

right now the closest flow i have is e2e testing and forcing the agent to run those testsā€¦which can be llms doing ā€˜browser-useā€™ or something similar if you want the non-determinism instead of straight playwright scripts.

related:
https://x.com/raw_works/status/1894578679264543134

i hacked together a server that uses Qwen-72B to review before/after UI edits as a workaround until cursor officially adds image support to mcp tools

lmk if you run into any issues :slight_smile:

1 Like

very cool!