Auto-add images to context or in other ways let models load images for vision

If I add a png to the chat context by dragging it or @-mentioning it, vision-capable models are able to read it and do wonderful things, but if I just ask a model in agent mode to read the png and report about it, it can’t and starts trying to create ocr programs in python and such… Is it possible to let models just use their vision automatically in YOLO mode? Would be really amazing.

+1 this would be great.