MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage
MedObvious exposes the Medical Moravec's Paradox in VLMs via a 1,880-task benchmark for clinical triage.
Key Findings
Methodology
MedObvious is a benchmark focusing on input validation in medical imaging, comprising 1,880 tasks across five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues. The benchmark evaluates 17 different vision language models (VLMs) and tests robustness across interfaces using multiple evaluation formats.
Key Results
- Among the 17 evaluated VLMs, many models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings.
- The best mean accuracy reaches 63.2%, yet negative-control accuracy spans a wide range, indicating that false alarms on normal inputs remain common.
- There are large gaps between multiple-choice and open-ended variants, indicating strong format sensitivity.
Significance
This study reveals that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment. By introducing the MedObvious benchmark, researchers can better evaluate and improve the input validation capabilities of VLMs in medical imaging, thereby enhancing the safety and reliability of these models in clinical applications.
Technical Contribution
MedObvious tests input validation as a distinct capability, unlike existing medical VLM benchmarks, which primarily assess the correctness of final answers. The introduction of MedObvious fills this gap, emphasizing the importance of visually obvious sanity checks in multi-image or agentic workflows.
Novelty
MedObvious is the first benchmark focusing on input validation in medical imaging, emphasizing set-level consistency over small multi-panel image sets. This innovation lies in revealing that models can produce coherent diagnostic narratives while ignoring basic sanity checks.
Limitations
- One limitation of this study is the use of simplified grids for testing rather than full multi-series volumes and interactive viewers.
- Models exhibit high false-alarm rates on normal inputs, indicating that normal-case calibration is a distinct problem from diagnostic fluency.
Future Work
Future work should extend to full multi-series volumes and interactive viewer-based evaluation to better simulate real clinical environments. Additionally, research should continue to explore how to improve model calibration on normal inputs to reduce false alarms.
AI Executive Summary
Vision Language Models (VLMs) are increasingly used in medical imaging for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a distinct capability. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.
The introduction of MedObvious fills a gap in existing medical VLM benchmarks, which primarily assess the correctness of final answers while ignoring the importance of input validation. By emphasizing set-level consistency over small multi-panel image sets, MedObvious reveals that models can produce coherent diagnostic narratives while ignoring basic sanity checks. The significance of this study lies in providing researchers with a tool to better evaluate and improve the input validation capabilities of VLMs in medical imaging, thereby enhancing the safety and reliability of these models in clinical applications.
However, this study also has limitations. Firstly, the use of simplified grids for testing rather than full multi-series volumes and interactive viewers may limit the applicability of the results. Additionally, models exhibit high false-alarm rates on normal inputs, indicating that normal-case calibration is a distinct problem from diagnostic fluency. Therefore, future work should extend to full multi-series volumes and interactive viewer-based evaluation to better simulate real clinical environments.
In conclusion, MedObvious provides a new benchmark for pre-diagnostic visual sanity checking in medical VLMs, emphasizing the importance of sanity checks. By revealing the shortcomings of models in input validation, this study offers new perspectives and directions for improving the safety and reliability of VLMs in medical imaging. Future research should continue to explore how to improve model calibration on normal inputs to reduce false alarms and extend to more complex clinical environments for evaluation.
Deep Analysis
Background
With the advancement of artificial intelligence technology, Vision Language Models (VLMs) are increasingly applied in medical imaging. These models can generate radiology-style descriptions, answer clinical questions, and perform multi-step reasoning over images and text. Recently, general-purpose models such as GPT-4o, Flamingo, and LLaVA, as well as medical adaptations like LLaVA-Med and RadFM, have been used for core perception in medical imaging. However, despite their excellent performance in generating coherent diagnostic narratives, they still exhibit significant gaps in basic sanity checks. Moravec's Paradox suggests that perception and spatial reasoning, trivial for humans, can be disproportionately difficult for machines even when higher-level outputs appear plausible. In medical imaging, this gap is consequential because failures occur before diagnosis: when the input is invalid or inconsistent, downstream reports become clinically uninterpretable.
Core Problem
In clinical practice, interpretation begins with pre-diagnostic sanity checks: clinicians first verify body part, view, modality, laterality, orientation, and basic image integrity, and they do not proceed to diagnosis if these checks fail. This requirement is amplified in multi-view ultrasound, multi-slice CT/MRI, and multi-panel viewer agentic workflows. Existing medical VLM benchmarks such as VQA-RAD, PathVQA, PMC-VQA, VQA-Med, and SLAKE primarily assess the correctness of final answers, while ignoring the importance of input validation. This leads to models producing coherent diagnostic narratives while ignoring basic sanity checks, making them brittle and potentially unsafe in multi-image or agentic workflows.
Innovation
MedObvious is the first benchmark focusing on input validation in medical imaging, emphasizing set-level consistency over small multi-panel image sets. Its innovations include: 1) testing input validation as a distinct capability rather than assuming this step is solved; 2) comprehensively evaluating model robustness across interfaces using five progressive tiers and five evaluation formats; 3) introducing negative controls to directly measure false alarm rates, revealing models' calibration deficiencies on normal inputs.
Methodology
The design of MedObvious includes the following key steps:
- �� Task Construction: Create 1,880 tasks across five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues.
- �� Evaluation Formats: Include five evaluation formats to test model robustness across interfaces.
- �� Dataset Selection: Use various medical imaging datasets, including chest radiographs, CT, MRI, ultrasound, etc., defined by metadata filtering.
- �� Template Generation: Create tasks by inserting images from different categories or via controlled integrity violations (e.g., orientation change or physically inconsistent composite).
- �� Negative Controls: Introduce explicit negative controls to directly measure false alarm rates.
Experiments
The experimental design includes evaluating 17 different VLMs using five evaluation formats to test their robustness across interfaces. Datasets include various medical imaging datasets such as chest radiographs, CT, MRI, ultrasound, etc. Baseline models include general-purpose models like GPT-4o, Flamingo, and LLaVA, as well as medical adaptations like LLaVA-Med and RadFM. Evaluation metrics include accuracy, false alarm rates, etc., with key hyperparameters including task number, evaluation formats, etc. Ablation studies are used to analyze the impact of different task tiers and evaluation formats on model performance.
Results
Among the 17 evaluated VLMs, many models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. The best mean accuracy reaches 63.2%, yet negative-control accuracy spans a wide range, indicating that false alarms on normal inputs remain common. There are large gaps between multiple-choice and open-ended variants, indicating strong format sensitivity. Ablation studies show significant differences in model performance across different task tiers and evaluation formats, particularly in set-level consistency over multi-image sets.
Applications
MedObvious provides a new benchmark for pre-diagnostic visual sanity checking in medical VLMs, emphasizing the importance of sanity checks. Direct application scenarios include input validation in medical imaging, clinical triage, etc. Prerequisites include model calibration on normal inputs, set-level consistency over multi-image sets, etc. Industry impact includes enhancing the safety and reliability of medical VLMs in clinical applications, reducing false alarms, and improving diagnostic accuracy.
Limitations & Outlook
One limitation of this study is the use of simplified grids for testing rather than full multi-series volumes and interactive viewers, which may limit the applicability of the results. Additionally, models exhibit high false-alarm rates on normal inputs, indicating that normal-case calibration is a distinct problem from diagnostic fluency. Computational costs are also a factor to consider, especially when evaluating on large-scale datasets and complex models. Future work should extend to full multi-series volumes and interactive viewer-based evaluation to better simulate real clinical environments and explore how to improve model calibration on normal inputs to reduce false alarms.
Plain Language Accessible to non-experts
Imagine you're cooking in a kitchen. You have a recipe that lists the ingredients you need and the steps to follow. Now, suppose you have a smart assistant that helps you check if the ingredients are correct, like whether the eggs are fresh or the milk is expired. This is similar to how Vision Language Models (VLMs) work in medical imaging. They help doctors check if the images are correct, like whether the orientation is right or the modality matches. However, sometimes these assistants might make mistakes, like mistaking a bad egg for a good one or expired milk for fresh. That's why we need a tool like MedObvious to test these assistants' abilities, ensuring they don't make mistakes when checking ingredients. Through this tool, we can identify the shortcomings of these assistants in ingredient checking and help them improve accuracy and reduce errors.
ELI14 Explained like you're 14
Hey there, buddy! You know when doctors look at X-rays, they don't just look at the image, they also make sure the image is correct? It's like when you're playing a game, you first check if the controller is connected and the console is on. Doctors need to check the image's orientation, modality, and more. That's when Vision Language Models (VLMs) come in as little helpers for doctors, helping them check these details. But sometimes these helpers make mistakes, just like you might press the wrong button sometimes. To make sure these helpers don't mess up, we need a tool called MedObvious to test their skills. This tool is like a super tester, helping us find out where the helpers fall short and making them smarter and more reliable. That way, doctors can use these helpers with more confidence!
Glossary
Vision Language Models (VLMs)
VLMs are AI models that combine visual and language capabilities, capable of understanding and generating images and text.
In this paper, VLMs are used for input validation and diagnostic text generation in medical imaging.
Input Validation
Input validation refers to checking the validity and consistency of input data before processing to ensure accuracy.
In this paper, input validation is used to check the orientation, modality, etc., of medical images.
Sanity Check
A sanity check involves verifying the basic integrity and consistency of data before performing complex analysis.
In clinical practice, sanity checks are used to verify the basic information of medical images.
Negative Control
A negative control is a sample used in experiments to test the false alarm rate of a model, typically a normal sample without anomalies.
In this paper, negative controls are used to measure the false alarm rate of models on normal inputs.
Multiple Choice (MCQ)
Multiple choice is an evaluation format that requires the subject to select one correct answer from multiple options.
In this paper, multiple choice is used to evaluate model performance in different tasks.
Open-ended Setting
An open-ended setting is an evaluation format that requires the subject to freely answer questions rather than selecting from preset options.
In this paper, open-ended settings are used to evaluate model performance in different tasks.
Ablation Study
An ablation study is an experimental method that evaluates the impact of removing certain parts of a model on overall performance.
In this paper, ablation studies are used to analyze the impact of different task tiers and evaluation formats on model performance.
Template Generation
Template generation is a method of creating experimental tasks by generating different test samples from preset templates.
In this paper, template generation is used to create tasks for MedObvious.
Multi-image Set
A multi-image set is a collection of multiple related images, typically used to evaluate model consistency across multiple views or slices.
In this paper, multi-image sets are used to test the input validation capabilities of models.
Moravec's Paradox
Moravec's Paradox suggests that perception and spatial reasoning, trivial for humans, can be disproportionately difficult for machines even when higher-level outputs appear plausible.
In this paper, Moravec's Paradox is used to explain the difficulties VLMs face in input validation.
Open Questions Unanswered questions from this research
- 1 Although MedObvious provides a new benchmark for input validation in medical VLMs, its applicability to full multi-series volumes and interactive viewers remains to be further validated.
- 2 Current models exhibit high false-alarm rates on normal inputs, indicating that normal-case calibration is a distinct problem from diagnostic fluency, and future research should explore how to improve models in this area.
- 3 The task design of MedObvious is based on simplified grids, which may limit its applicability in real clinical environments, and future research should extend to more complex clinical environments for evaluation.
- 4 Although MedObvious reveals the shortcomings of models in input validation, its impact on models' ability to generate coherent diagnostic narratives remains unclear, and future research should explore this aspect.
- 5 Current evaluation formats mainly focus on multiple-choice and open-ended settings, and future research should explore other possible evaluation formats to more comprehensively evaluate model capabilities.
Applications
Immediate Applications
Medical Imaging Input Validation
MedObvious can be used for input validation in medical imaging, helping doctors check the orientation, modality, etc., of images to ensure input validity and consistency.
Clinical Triage
Through sanity checks, MedObvious can help doctors quickly identify abnormal images in clinical triage, improving diagnostic efficiency.
Medical Education
MedObvious can be used as a tool in medical education to help students learn how to perform sanity checks on medical images.
Long-term Vision
Automated Medical Diagnosis
By improving the input validation capabilities of VLMs, MedObvious can provide a foundation for automated medical diagnosis, reducing human errors and improving diagnostic accuracy.
Intelligent Medical Assistants
MedObvious can support the development of intelligent medical assistants, helping doctors perform sanity checks in complex clinical environments and improving the quality of medical services.
Abstract
Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.
References (20)
MedRAX: Medical Reasoning Agent for Chest X-ray
Adibvafa Fallahpour, Jun Ma, Alif Munim et al.
PathVQA: 30000+ Questions for Medical Visual Question Answering
Xuehai He, Yichen Zhang, Luntian Mou et al.
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao et al.
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Chunyuan Li, Cliff Wong, Sheng Zhang et al.
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Lasa Team, Weiwen Xu, Hou Pong Chan et al.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen et al.
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
Jiawei Chen, Dingkang Yang, Tong Wu et al.
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding
Songtao Jiang, Yuan Wang, Sibo Song et al.
Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data
Chaoyi Wu, Xiaoman Zhang, Ya Zhang et al.
Radiology Objects in COntext (ROCO): A Multimodal Image Dataset
Obioma Pelka, Sven Koitka, Johannes Rückert et al.
Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Qingyang Wu et al.
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan et al.
VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019
Asma Ben Abacha, Sadid A. Hasan, Vivek Datla et al.
Pixtral 12B
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna et al.
Slake: A Semantically-Labeled Knowledge-Enhanced Dataset For Medical Visual Question Answering
Bo Liu, Li-Ming Zhan, Li Xu et al.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Zhe Chen, Weiyun Wang, Yue Cao et al.
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Jiazhen Pan, Che Liu, Junde Wu et al.
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge
Vishwesh Nath, Wenqi Li, Dong Yang et al.
3D Slicer as an image computing platform for the Quantitative Imaging Network.
Andrey Fedorov, R. Beichel, Jayashree Kalpathy-Cramer et al.
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc et al.