One kind of learning/training task might be something like this:
Input Verbal Description – task is to search image database for top k matches to description, possibly cropping the image – these k are given scores for how successful they are for the desired match.
In some cases, a rough photo is to be photo edited to make a good match. The resulting pattern of edits & product can also be studied.
Methodology Point: In similarity matching tasks, a non-geometric measure often performs geometric forms: Measure of the Feature Intersection / Measure of the Feature Union. Intuitively, there must be significant penalties for adding distracting, unwanted features to the match. It will save time to built the prediction of that into the algorithms.