This was one of the motivations for `clip-retrieval`, a faiss index over the CLIP embeddings (CLIP ViT-L/14 to be precise) for all the captions/images in the LAION5B-Aesthetic dataset.
I tried "a man with shopping bags stopping a tank" and was hoping to get the Tiananmen Tank Man, but I'm having no luck with variations either.
EDIT: it does contain a blurry picture of the tank man and some LEGO re-enactments when I query "tiananmen tank man", but was hoping it would more intelligently deduce the picture from the description
https://rom1504.github.io/clip-retrieval
Try the reverse image search - it can be shockingly effective.
You can pretty easily rehost the index or build a lookup over your own data if you check the GitHub repo.
If you don't have any data of your own, enter a query and hit that download icon to get a CSV of `URL,Caption,CLIP score`.