Agents already produce text for feedback through running linters and reading their errors. So a similar loop should be feasible for images. An agent should view whatever images/videos/audio it generates and add at least images to its own context.
Agents already produce text for feedback through running linters and reading their errors. So a similar loop should be feasible for images. An agent should view whatever images/videos/audio it generates and add at least images to its own context.