Model Human Metric Auto Metric Identify (Binary Accuracy) Humans 95 92 Ground-truth Caption → Llama-2-7b (Oracle) 71 Ground-truth Caption → GPT3 (Oracle) 68 70 74 Ground-truth Caption → Llama-2-13b (Oracle) 70 Ground-truth Caption → GPT4 (Oracle) 69 Predicted Caption → GPT3 33 36 59 Predicted Caption → Llama-2-7b 36 Predicted Caption → Llama-2-13b 36 Predicted Caption → GPT4 36 InstructBLIP 31 LLaVA 31 BLIP2 FlanT5-XXL (Fine-tuned) 27 27 73 mPLUG-Owl 24 BLIP2 FlanT5-XL (Fine-tuned) 15 18 60 BLIP2 FlanT5-XXL (Zero-shot) 0 12 50