"see what you've done" - multimodal yolo mode composer agent

anyone figured out a “multimodal yolo mode composer agent”?

simple example: a script that makes an image. i don’t just want the agent to be able to grep around and see that it correctly bounced an image file, i want it to look at the image on its own and “see it’s creation”.

Let Composer Pass In Images In YOLO Mode is the closest thing i can find…

1 Like

not very promising:

very easy to prompt the agent to export an image, but it seems to think that it is not allowed to add the image or see the image.

it’s going to be a long night at this rate :wink:

in case this is useful to anyone else…here is what composer agent thinks it can do:

The user is asking about all the tool possibilities that I have available. This is a good opportunity to provide a comprehensive overview of the tools I can use to assist them. Based on the information provided in my system prompt, I have access to the following tools:
  1. codebase_search - For semantic search of code in the codebase
  2. read_file - To read the contents of a file
  3. run_terminal_cmd - To propose and run terminal commands
  4. list_dir - To list the contents of a directory
  5. grep_search - For text-based regex search within files
  6. edit_file - To propose edits to existing files
  7. file_search - For fuzzy file path searching
  8. delete_file - To delete files
  9. reapply - To reapply edits when the initial application didn’t work as expected
  10. fetch_rules - To fetch rules provided by the user about the codebase
  11. diff_history - To retrieve the history of recent changes to files

I should provide a clear and concise explanation of each tool’s purpose and capabilities to the user.

I have the following tools available to assist you:

  1. Codebase Search - Performs semantic search to find relevant code snippets based on your queries
  2. Read File - Reads the contents of specified files, either partially or completely
  3. Terminal Command Runner - Proposes and executes terminal commands (with your approval)
  4. Directory Listing - Lists the contents of directories to explore file structure
  5. Grep Search - Performs text-based regex searches for exact pattern matching in files
  6. File Editor - Proposes and applies edits to existing files
  7. File Search - Finds files using fuzzy matching against file paths
  8. File Deletion - Removes files from the workspace
  9. Edit Reapplication - Reapplies edits when the initial application didn’t work as expected
  10. Rules Fetcher - Retrieves custom rules you’ve provided about the codebase
  11. Diff History - Shows the history of recent changes made to files

Each tool is designed to help with different aspects of coding tasks, from exploring and understanding your codebase to making changes and running commands. Is there a specific task you’d like to work on using these tools?

and it is pretty sure it cannot read images.

I need to clarify that I can't directly read binary files like PNG images using the read_file tool. The read_file tool is designed to read text files, and attempting to read a binary image file would result in unreadable output or potentially an error.

Looking at the file path provided, it seems to be a screenshot of what might be a hirth joint (which is a type of mechanical joint often used in engineering). Instead of trying to read the binary file directly, I should explain this limitation to the user and suggest alternatives if they want to work with the image.

I’m unable to read PNG image files directly with the read_file tool as it’s designed for text files, not binary files like images. The read_file tool would only return unreadable binary data if I attempted to read the PNG file.

If you’d like to work with this image file, I could suggest alternatives:

  1. You could describe what’s in the image to me
  2. We could use terminal commands to verify the file exists or get basic information about it
  3. If you need to process or analyze the image, we could write code to handle image files

Would you like to pursue any of these alternatives, or is there a different file you’d like me to read?

2 Likes

You can build a custom MCP server! That way LLMs can reference images.

2 Likes

how would the mcp server add the file to the agents context?

i can see how a MCP server could call another LLM and tell the agent what it sees, but how can a MCP server actually return a file that the agent can see?

I’ve been thinking about this too. thanks for sharing your work!

i think exposing the screenshots through resources could work since cursor recently added support for it?

will try it and let you know if it works

interesting!

can someone from the cursor team confirm that the MCP client configuration is set up to make this possible?

1 Like

in the meantime, you could read files from disk and send them to any vision model using your own api keys using GitHub - catalystneuro/mcp_read_images

thanks for this workaround!

i might get this desperate, but i’m really worried about creating a game of “telephone” with “the blind leading the blind”. at least in the context of cursor, i think the only thing that would have a hope of being practical is a unified chat history where the same model can “see” the image.

maybe i’m wrong - maybe i should think of the visual inspection as an e2e test and just call the same model for consistency, with a pass/fail criteria that it sends back.

1 Like

totally agree, the current agent has all the context, its only logical that it should be the one making the judgement of whether to continue editing or stop

… screenshots as base64 data directly to the terminal, rather than saving them as image files that Cursor can reference …

For now you can work around this by having the server save screenshots to a local path and return that path instead of the base64 data