"see what you've done" - multimodal yolo mode composer agent

raw.works · February 26, 2025, 1:20am

anyone figured out a “multimodal yolo mode composer agent”?

simple example: a script that makes an image. i don’t just want the agent to be able to grep around and see that it correctly bounced an image file, i want it to look at the image on its own and “see it’s creation”.

Let Composer Pass In Images In YOLO Mode is the closest thing i can find…

raw.works · February 26, 2025, 1:41am

not very promising:

very easy to prompt the agent to export an image, but it seems to think that it is not allowed to add the image or see the image.

raw.works · February 26, 2025, 1:44am

it’s going to be a long night at this rate

raw.works · February 26, 2025, 1:56am

in case this is useful to anyone else…here is what composer agent thinks it can do:

The user is asking about all the tool possibilities that I have available. This is a good opportunity to provide a comprehensive overview of the tools I can use to assist them. Based on the information provided in my system prompt, I have access to the following tools:

codebase_search - For semantic search of code in the codebase

read_file - To read the contents of a file

run_terminal_cmd - To propose and run terminal commands

list_dir - To list the contents of a directory

grep_search - For text-based regex search within files

edit_file - To propose edits to existing files

file_search - For fuzzy file path searching

delete_file - To delete files

reapply - To reapply edits when the initial application didn’t work as expected

fetch_rules - To fetch rules provided by the user about the codebase

diff_history - To retrieve the history of recent changes to files

I should provide a clear and concise explanation of each tool’s purpose and capabilities to the user.

I have the following tools available to assist you:

Codebase Search - Performs semantic search to find relevant code snippets based on your queries

Read File - Reads the contents of specified files, either partially or completely

Terminal Command Runner - Proposes and executes terminal commands (with your approval)

Directory Listing - Lists the contents of directories to explore file structure

Grep Search - Performs text-based regex searches for exact pattern matching in files

File Editor - Proposes and applies edits to existing files

File Search - Finds files using fuzzy matching against file paths

File Deletion - Removes files from the workspace

Edit Reapplication - Reapplies edits when the initial application didn’t work as expected

Rules Fetcher - Retrieves custom rules you’ve provided about the codebase

Diff History - Shows the history of recent changes made to files

Each tool is designed to help with different aspects of coding tasks, from exploring and understanding your codebase to making changes and running commands. Is there a specific task you’d like to work on using these tools?

and it is pretty sure it cannot read images.

I need to clarify that I can't directly read binary files like PNG images using the read_file tool. The read_file tool is designed to read text files, and attempting to read a binary image file would result in unreadable output or potentially an error.
Looking at the file path provided, it seems to be a screenshot of what might be a hirth joint (which is a type of mechanical joint often used in engineering). Instead of trying to read the binary file directly, I should explain this limitation to the user and suggest alternatives if they want to work with the image.

I’m unable to read PNG image files directly with the read_file tool as it’s designed for text files, not binary files like images. The read_file tool would only return unreadable binary data if I attempted to read the PNG file.

If you’d like to work with this image file, I could suggest alternatives:

You could describe what’s in the image to me

We could use terminal commands to verify the file exists or get basic information about it

If you need to process or analyze the image, we could write code to handle image files

Would you like to pursue any of these alternatives, or is there a different file you’d like me to read?

gandan.dev · February 26, 2025, 7:57am

You can build a custom MCP server! That way LLMs can reference images.

raw.works · February 26, 2025, 3:14pm

how would the mcp server add the file to the agents context?

i can see how a MCP server could call another LLM and tell the agent what it sees, but how can a MCP server actually return a file that the agent can see?

amxv · February 26, 2025, 4:16pm

I’ve been thinking about this too. thanks for sharing your work!

i think exposing the screenshots through resources could work since cursor recently added support for it?

will try it and let you know if it works

raw.works · February 26, 2025, 4:18pm

interesting!

can someone from the cursor team confirm that the MCP client configuration is set up to make this possible?

amxv · February 26, 2025, 4:29pm

in the meantime, you could read files from disk and send them to any vision model using your own api keys using GitHub - catalystneuro/mcp_read_images

raw.works · February 26, 2025, 4:44pm

thanks for this workaround!

i might get this desperate, but i’m really worried about creating a game of “telephone” with “the blind leading the blind”. at least in the context of cursor, i think the only thing that would have a hope of being practical is a unified chat history where the same model can “see” the image.

maybe i’m wrong - maybe i should think of the visual inspection as an e2e test and just call the same model for consistency, with a pass/fail criteria that it sends back.

amxv · February 26, 2025, 4:48pm

totally agree, the current agent has all the context, its only logical that it should be the one making the judgement of whether to continue editing or stop

gandan.dev · February 27, 2025, 12:33am

… screenshots as base64 data directly to the terminal, rather than saving them as image files that Cursor can reference …

For now you can work around this by having the server save screenshots to a local path and return that path instead of the base64 data

kimmomant · May 28, 2025, 10:58am

I made an MCP tool to connect to any model hosted locally in LM Studio:

Basically for the same purpose. The read_file tool won’t read the PNG file. So instead I run a vision-enabled LLM such as Gemma or LLaVA in LM Studio. There’s an MCP tool for LM Studio but that one doesn’t support image capabilities so I vibe coded an MCP tool that does.

Topic		Replies	Views
Let Composer Pass In Images In YOLO Mode Feature Requests	7	117	April 23, 2025
Ability to inspect images in a target @folder in chat Feature Requests	0	85	May 14, 2024
Read images from filesystem Discussions	1	289	March 23, 2025
YOLO Mode is Amazing! Feedback	40	34147	March 27, 2025
Chat Mode questions Discussions	1	98	March 26, 2025

"see what you've done" - multimodal yolo mode composer agent

Related topics