Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhanced—are unlikely to be adopted for clinical tasks. We introduce MedBLINK, a benchmark designed to probe these models for such perceptual abilities. MedBLINK spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images.
We evaluate 19 state-of-the-art MLMs, including general-purpose (GPT‑4o, Claude 3.5 Sonnet) and domain-specific (Med-Flamingo, LLaVA-Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption.
      
      
      @misc{bigverdi2025medblinkprobingbasicperception,
        title={MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine}, 
        author={Mahtab Bigverdi and Wisdom Ikezogwo and Kevin Zhang and Hyewon Jeong and Mingyu Lu and Sungjae Cho and Linda Shapiro and Ranjay Krishna},
        year={2025},
        eprint={2508.02951},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2508.02951},}