anyone figured out a “multimodal yolo mode composer agent”?
simple example: a script that makes an image. i don’t just want the agent to be able to grep around and see that it correctly bounced an image file, i want it to look at the image on its own and “see it’s creation”.
in case this is useful to anyone else…here is what composer agent thinks it can do:
The user is asking about all the tool possibilities that I have available. This is a good opportunity to provide a comprehensive overview of the tools I can use to assist them. Based on the information provided in my system prompt, I have access to the following tools:
codebase_search - For semantic search of code in the codebase
read_file - To read the contents of a file
run_terminal_cmd - To propose and run terminal commands
list_dir - To list the contents of a directory
grep_search - For text-based regex search within files
edit_file - To propose edits to existing files
file_search - For fuzzy file path searching
delete_file - To delete files
reapply - To reapply edits when the initial application didn’t work as expected
fetch_rules - To fetch rules provided by the user about the codebase
diff_history - To retrieve the history of recent changes to files
I should provide a clear and concise explanation of each tool’s purpose and capabilities to the user.
I have the following tools available to assist you:
Codebase Search - Performs semantic search to find relevant code snippets based on your queries
Read File - Reads the contents of specified files, either partially or completely
Terminal Command Runner - Proposes and executes terminal commands (with your approval)
Directory Listing - Lists the contents of directories to explore file structure
Grep Search - Performs text-based regex searches for exact pattern matching in files
File Editor - Proposes and applies edits to existing files
File Search - Finds files using fuzzy matching against file paths
File Deletion - Removes files from the workspace
Edit Reapplication - Reapplies edits when the initial application didn’t work as expected
Rules Fetcher - Retrieves custom rules you’ve provided about the codebase
Diff History - Shows the history of recent changes made to files
Each tool is designed to help with different aspects of coding tasks, from exploring and understanding your codebase to making changes and running commands. Is there a specific task you’d like to work on using these tools?
and it is pretty sure it cannot read images.
I need to clarify that I can't directly read binary files like PNG images using the read_file tool. The read_file tool is designed to read text files, and attempting to read a binary image file would result in unreadable output or potentially an error.
Looking at the file path provided, it seems to be a screenshot of what might be a hirth joint (which is a type of mechanical joint often used in engineering). Instead of trying to read the binary file directly, I should explain this limitation to the user and suggest alternatives if they want to work with the image.
I’m unable to read PNG image files directly with the read_file tool as it’s designed for text files, not binary files like images. The read_file tool would only return unreadable binary data if I attempted to read the PNG file.
If you’d like to work with this image file, I could suggest alternatives:
You could describe what’s in the image to me
We could use terminal commands to verify the file exists or get basic information about it
If you need to process or analyze the image, we could write code to handle image files
Would you like to pursue any of these alternatives, or is there a different file you’d like me to read?
how would the mcp server add the file to the agents context?
i can see how a MCP server could call another LLM and tell the agent what it sees, but how can a MCP server actually return a file that the agent can see?
i might get this desperate, but i’m really worried about creating a game of “telephone” with “the blind leading the blind”. at least in the context of cursor, i think the only thing that would have a hope of being practical is a unified chat history where the same model can “see” the image.
maybe i’m wrong - maybe i should think of the visual inspection as an e2e test and just call the same model for consistency, with a pass/fail criteria that it sends back.
totally agree, the current agent has all the context, its only logical that it should be the one making the judgement of whether to continue editing or stop