Agents should be able to view images in chat and add it to their own prompt

Agents already produce text for feedback through running linters and reading their errors. So a similar loop should be feasible for images. An agent should view whatever images/videos/audio it generates and add at least images to its own context.