MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

TL;DR

MedObvious exposes the Medical Moravec's Paradox in VLMs via a 1,880-task benchmark for clinical triage.

cs.CV 🔴 Advanced 2026-03-25 46 views
Ufaq Khan Umair Nawaz L D M S S Teja Numaan Saeed Muhammad Bilal Yutong Xie Mohammad Yaqub Muhammad Haris Khan
medical imaging vision language models input validation clinical triage safety

Key Findings

Methodology

MedObvious is a benchmark focusing on input validation in medical imaging, comprising 1,880 tasks across five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues. The benchmark evaluates 17 different vision language models (VLMs) and tests robustness across interfaces using multiple evaluation formats.

Key Results

  • Among the 17 evaluated VLMs, many models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings.
  • The best mean accuracy reaches 63.2%, yet negative-control accuracy spans a wide range, indicating that false alarms on normal inputs remain common.
  • There are large gaps between multiple-choice and open-ended variants, indicating strong format sensitivity.

Significance

This study reveals that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment. By introducing the MedObvious benchmark, researchers can better evaluate and improve the input validation capabilities of VLMs in medical imaging, thereby enhancing the safety and reliability of these models in clinical applications.

Technical Contribution

MedObvious tests input validation as a distinct capability, unlike existing medical VLM benchmarks, which primarily assess the correctness of final answers. The introduction of MedObvious fills this gap, emphasizing the importance of visually obvious sanity checks in multi-image or agentic workflows.

Novelty

MedObvious is the first benchmark focusing on input validation in medical imaging, emphasizing set-level consistency over small multi-panel image sets. This innovation lies in revealing that models can produce coherent diagnostic narratives while ignoring basic sanity checks.

Limitations

  • One limitation of this study is the use of simplified grids for testing rather than full multi-series volumes and interactive viewers.
  • Models exhibit high false-alarm rates on normal inputs, indicating that normal-case calibration is a distinct problem from diagnostic fluency.

Future Work

Future work should extend to full multi-series volumes and interactive viewer-based evaluation to better simulate real clinical environments. Additionally, research should continue to explore how to improve model calibration on normal inputs to reduce false alarms.

AI Executive Summary

Vision Language Models (VLMs) are increasingly used in medical imaging for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a distinct capability. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

The introduction of MedObvious fills a gap in existing medical VLM benchmarks, which primarily assess the correctness of final answers while ignoring the importance of input validation. By emphasizing set-level consistency over small multi-panel image sets, MedObvious reveals that models can produce coherent diagnostic narratives while ignoring basic sanity checks. The significance of this study lies in providing researchers with a tool to better evaluate and improve the input validation capabilities of VLMs in medical imaging, thereby enhancing the safety and reliability of these models in clinical applications.

However, this study also has limitations. Firstly, the use of simplified grids for testing rather than full multi-series volumes and interactive viewers may limit the applicability of the results. Additionally, models exhibit high false-alarm rates on normal inputs, indicating that normal-case calibration is a distinct problem from diagnostic fluency. Therefore, future work should extend to full multi-series volumes and interactive viewer-based evaluation to better simulate real clinical environments.

In conclusion, MedObvious provides a new benchmark for pre-diagnostic visual sanity checking in medical VLMs, emphasizing the importance of sanity checks. By revealing the shortcomings of models in input validation, this study offers new perspectives and directions for improving the safety and reliability of VLMs in medical imaging. Future research should continue to explore how to improve model calibration on normal inputs to reduce false alarms and extend to more complex clinical environments for evaluation.

Deep Analysis

Background

With the advancement of artificial intelligence technology, Vision Language Models (VLMs) are increasingly applied in medical imaging. These models can generate radiology-style descriptions, answer clinical questions, and perform multi-step reasoning over images and text. Recently, general-purpose models such as GPT-4o, Flamingo, and LLaVA, as well as medical adaptations like LLaVA-Med and RadFM, have been used for core perception in medical imaging. However, despite their excellent performance in generating coherent diagnostic narratives, they still exhibit significant gaps in basic sanity checks. Moravec's Paradox suggests that perception and spatial reasoning, trivial for humans, can be disproportionately difficult for machines even when higher-level outputs appear plausible. In medical imaging, this gap is consequential because failures occur before diagnosis: when the input is invalid or inconsistent, downstream reports become clinically uninterpretable.

Core Problem

In clinical practice, interpretation begins with pre-diagnostic sanity checks: clinicians first verify body part, view, modality, laterality, orientation, and basic image integrity, and they do not proceed to diagnosis if these checks fail. This requirement is amplified in multi-view ultrasound, multi-slice CT/MRI, and multi-panel viewer agentic workflows. Existing medical VLM benchmarks such as VQA-RAD, PathVQA, PMC-VQA, VQA-Med, and SLAKE primarily assess the correctness of final answers, while ignoring the importance of input validation. This leads to models producing coherent diagnostic narratives while ignoring basic sanity checks, making them brittle and potentially unsafe in multi-image or agentic workflows.

Innovation

MedObvious is the first benchmark focusing on input validation in medical imaging, emphasizing set-level consistency over small multi-panel image sets. Its innovations include: 1) testing input validation as a distinct capability rather than assuming this step is solved; 2) comprehensively evaluating model robustness across interfaces using five progressive tiers and five evaluation formats; 3) introducing negative controls to directly measure false alarm rates, revealing models' calibration deficiencies on normal inputs.

Methodology

The design of MedObvious includes the following key steps:


  • �� Task Construction: Create 1,880 tasks across five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues.

  • �� Evaluation Formats: Include five evaluation formats to test model robustness across interfaces.

  • �� Dataset Selection: Use various medical imaging datasets, including chest radiographs, CT, MRI, ultrasound, etc., defined by metadata filtering.

  • �� Template Generation: Create tasks by inserting images from different categories or via controlled integrity violations (e.g., orientation change or physically inconsistent composite).

  • �� Negative Controls: Introduce explicit negative controls to directly measure false alarm rates.

Experiments

The experimental design includes evaluating 17 different VLMs using five evaluation formats to test their robustness across interfaces. Datasets include various medical imaging datasets such as chest radiographs, CT, MRI, ultrasound, etc. Baseline models include general-purpose models like GPT-4o, Flamingo, and LLaVA, as well as medical adaptations like LLaVA-Med and RadFM. Evaluation metrics include accuracy, false alarm rates, etc., with key hyperparameters including task number, evaluation formats, etc. Ablation studies are used to analyze the impact of different task tiers and evaluation formats on model performance.

Results

Among the 17 evaluated VLMs, many models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. The best mean accuracy reaches 63.2%, yet negative-control accuracy spans a wide range, indicating that false alarms on normal inputs remain common. There are large gaps between multiple-choice and open-ended variants, indicating strong format sensitivity. Ablation studies show significant differences in model performance across different task tiers and evaluation formats, particularly in set-level consistency over multi-image sets.

Applications

MedObvious provides a new benchmark for pre-diagnostic visual sanity checking in medical VLMs, emphasizing the importance of sanity checks. Direct application scenarios include input validation in medical imaging, clinical triage, etc. Prerequisites include model calibration on normal inputs, set-level consistency over multi-image sets, etc. Industry impact includes enhancing the safety and reliability of medical VLMs in clinical applications, reducing false alarms, and improving diagnostic accuracy.

Limitations & Outlook

One limitation of this study is the use of simplified grids for testing rather than full multi-series volumes and interactive viewers, which may limit the applicability of the results. Additionally, models exhibit high false-alarm rates on normal inputs, indicating that normal-case calibration is a distinct problem from diagnostic fluency. Computational costs are also a factor to consider, especially when evaluating on large-scale datasets and complex models. Future work should extend to full multi-series volumes and interactive viewer-based evaluation to better simulate real clinical environments and explore how to improve model calibration on normal inputs to reduce false alarms.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. You have a recipe that lists the ingredients you need and the steps to follow. Now, suppose you have a smart assistant that helps you check if the ingredients are correct, like whether the eggs are fresh or the milk is expired. This is similar to how Vision Language Models (VLMs) work in medical imaging. They help doctors check if the images are correct, like whether the orientation is right or the modality matches. However, sometimes these assistants might make mistakes, like mistaking a bad egg for a good one or expired milk for fresh. That's why we need a tool like MedObvious to test these assistants' abilities, ensuring they don't make mistakes when checking ingredients. Through this tool, we can identify the shortcomings of these assistants in ingredient checking and help them improve accuracy and reduce errors.

ELI14 Explained like you're 14

Hey there, buddy! You know when doctors look at X-rays, they don't just look at the image, they also make sure the image is correct? It's like when you're playing a game, you first check if the controller is connected and the console is on. Doctors need to check the image's orientation, modality, and more. That's when Vision Language Models (VLMs) come in as little helpers for doctors, helping them check these details. But sometimes these helpers make mistakes, just like you might press the wrong button sometimes. To make sure these helpers don't mess up, we need a tool called MedObvious to test their skills. This tool is like a super tester, helping us find out where the helpers fall short and making them smarter and more reliable. That way, doctors can use these helpers with more confidence!

Glossary

Vision Language Models (VLMs)

VLMs are AI models that combine visual and language capabilities, capable of understanding and generating images and text.

In this paper, VLMs are used for input validation and diagnostic text generation in medical imaging.

Input Validation

Input validation refers to checking the validity and consistency of input data before processing to ensure accuracy.

In this paper, input validation is used to check the orientation, modality, etc., of medical images.

Sanity Check

A sanity check involves verifying the basic integrity and consistency of data before performing complex analysis.

In clinical practice, sanity checks are used to verify the basic information of medical images.

Negative Control

A negative control is a sample used in experiments to test the false alarm rate of a model, typically a normal sample without anomalies.

In this paper, negative controls are used to measure the false alarm rate of models on normal inputs.

Multiple Choice (MCQ)

Multiple choice is an evaluation format that requires the subject to select one correct answer from multiple options.

In this paper, multiple choice is used to evaluate model performance in different tasks.

Open-ended Setting

An open-ended setting is an evaluation format that requires the subject to freely answer questions rather than selecting from preset options.

In this paper, open-ended settings are used to evaluate model performance in different tasks.

Ablation Study

An ablation study is an experimental method that evaluates the impact of removing certain parts of a model on overall performance.

In this paper, ablation studies are used to analyze the impact of different task tiers and evaluation formats on model performance.

Template Generation

Template generation is a method of creating experimental tasks by generating different test samples from preset templates.

In this paper, template generation is used to create tasks for MedObvious.

Multi-image Set

A multi-image set is a collection of multiple related images, typically used to evaluate model consistency across multiple views or slices.

In this paper, multi-image sets are used to test the input validation capabilities of models.

Moravec's Paradox

Moravec's Paradox suggests that perception and spatial reasoning, trivial for humans, can be disproportionately difficult for machines even when higher-level outputs appear plausible.

In this paper, Moravec's Paradox is used to explain the difficulties VLMs face in input validation.

Open Questions Unanswered questions from this research

  • 1 Although MedObvious provides a new benchmark for input validation in medical VLMs, its applicability to full multi-series volumes and interactive viewers remains to be further validated.
  • 2 Current models exhibit high false-alarm rates on normal inputs, indicating that normal-case calibration is a distinct problem from diagnostic fluency, and future research should explore how to improve models in this area.
  • 3 The task design of MedObvious is based on simplified grids, which may limit its applicability in real clinical environments, and future research should extend to more complex clinical environments for evaluation.
  • 4 Although MedObvious reveals the shortcomings of models in input validation, its impact on models' ability to generate coherent diagnostic narratives remains unclear, and future research should explore this aspect.
  • 5 Current evaluation formats mainly focus on multiple-choice and open-ended settings, and future research should explore other possible evaluation formats to more comprehensively evaluate model capabilities.

Applications

Immediate Applications

Medical Imaging Input Validation

MedObvious can be used for input validation in medical imaging, helping doctors check the orientation, modality, etc., of images to ensure input validity and consistency.

Clinical Triage

Through sanity checks, MedObvious can help doctors quickly identify abnormal images in clinical triage, improving diagnostic efficiency.

Medical Education

MedObvious can be used as a tool in medical education to help students learn how to perform sanity checks on medical images.

Long-term Vision

Automated Medical Diagnosis

By improving the input validation capabilities of VLMs, MedObvious can provide a foundation for automated medical diagnosis, reducing human errors and improving diagnostic accuracy.

Intelligent Medical Assistants

MedObvious can support the development of intelligent medical assistants, helping doctors perform sanity checks in complex clinical environments and improving the quality of medical services.

Abstract

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

cs.CV cs.AI cs.CL

References (20)

MedRAX: Medical Reasoning Agent for Chest X-ray

Adibvafa Fallahpour, Jun Ma, Alif Munim et al.

2025 44 citations View Analysis →

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He, Yichen Zhang, Luntian Mou et al.

2020 437 citations View Analysis →

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao et al.

2023 301 citations View Analysis →

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li, Cliff Wong, Sheng Zhang et al.

2023 1519 citations View Analysis →

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Lasa Team, Weiwen Xu, Hou Pong Chan et al.

2025 99 citations View Analysis →

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen et al.

2025 1089 citations View Analysis →

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jiawei Chen, Dingkang Yang, Tong Wu et al.

2024 47 citations View Analysis →

OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding

Songtao Jiang, Yuan Wang, Sibo Song et al.

2025 14 citations View Analysis →

Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data

Chaoyi Wu, Xiaoman Zhang, Ya Zhang et al.

2025 164 citations

Radiology Objects in COntext (ROCO): A Multimodal Image Dataset

Obioma Pelka, Sven Koitka, Johannes Rückert et al.

2018 312 citations

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu et al.

2023 8573 citations View Analysis →

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

2024 3606 citations View Analysis →

VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019

Asma Ben Abacha, Sadid A. Hasan, Vivek Datla et al.

2019 297 citations

Pixtral 12B

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna et al.

2024 138 citations View Analysis →

Slake: A Semantically-Labeled Knowledge-Enhanced Dataset For Medical Visual Question Answering

Bo Liu, Li-Ming Zhan, Li Xu et al.

2021 504 citations View Analysis →

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao et al.

2024 1303 citations View Analysis →

MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

Jiazhen Pan, Che Liu, Junde Wu et al.

2025 144 citations View Analysis →

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Vishwesh Nath, Wenqi Li, Dong Yang et al.

2024 49 citations View Analysis →

3D Slicer as an image computing platform for the Quantitative Imaging Network.

Andrey Fedorov, R. Beichel, Jayashree Kalpathy-Cramer et al.

2012 7801 citations

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc et al.

2022 5386 citations View Analysis →