A Vision-language Framework for Comparative Reasoning in Radiology

TL;DR

Proposes MedReCo, an entity-aware vision-language framework with over 690,000 images for clinical case retrieval and change description.

cs.CV 🔴 Advanced 2026-06-05 84 views

Tengfei Zhang Ziheng Zhao Lisong Dai Xiaoman Zhang Pengcheng Qiu Ya Zhang Yanfeng Wang Weidi Xie

AI Reader Arxiv Page Download PDF

Medical Imaging AI Cross-image Reasoning Vision-Language Models Clinical Applications Large-scale Dataset

Key Findings

Methodology

This paper introduces an entity-aware cross-image reasoning framework combining multimodal imaging data with structured clinical entity annotations. The core components include MedReCo visual encoder and MedReCo-VLM generative model. The visual encoder employs modality-aware contrastive learning with entity-conditioned attention mechanisms to learn fine-grained visual features aligned with specific clinical entities such as anatomical structures, abnormal findings, and pathological conditions. The dataset, MedReCo-DB, contains over 690,000 images from eight institutions, with reports decomposed into 42 anatomical, 69 abnormal, and 28 pathological labels, providing supervision for entity-conditioned retrieval and question answering. The training involves multi-task optimization, balancing contrastive learning for retrieval and transformer-based generative modeling for description. The framework enables controllable reference case retrieval and natural language-based comparative interpretation, validated through extensive internal and external evaluations.

Key Results

In internal validation, MedReCo achieved the highest Recall@1 across 12 retrieval tasks, with an average improvement of 6.0 percentage points over baselines like CT-CLIP and MedCLIP. External validation on unseen institutions showed a 6.0-point average boost, demonstrating robustness. In clinically confusable groups, the model outperformed baselines by 10.9 points in Recall@1, effectively distinguishing subtle differences. For temporal comparison, MedReCo-VLM attained 87.1% accuracy on public VQA benchmarks, with longitudinal follow-up accuracy improvements of 14.5-46.5 percentage points on chest X-ray and 13.0-27.9 on CT. These results confirm the model's capacity for fine-grained entity-aware reasoning and change detection, crucial for clinical decision support.

Significance

This work advances medical imaging AI by integrating entity-aware mechanisms into cross-image reasoning, aligning model capabilities with clinical diagnostic workflows. The large-scale dataset and multi-task training enable the model to recognize subtle anatomical and pathological differences, facilitating accurate case retrieval and change description. Such capabilities address longstanding challenges in medical AI, bridging the gap between technical performance and clinical utility. The framework's robustness across institutions and modalities underscores its potential for real-world deployment, promising improvements in diagnostic accuracy, treatment monitoring, and medical education. It sets a new benchmark for clinically aligned AI systems in radiology, fostering trust and adoption in healthcare settings.

Technical Contribution

The key technical innovation lies in the design of an entity-aware visual encoder that incorporates clinical entity conditions into the contrastive learning process, enabling fine-grained visual feature extraction aligned with specific anatomical or pathological labels. The model employs modality-aware encoders to handle heterogeneity across X-ray, CT, MRI, and ultrasound, with entity-conditioned attention mechanisms to isolate relevant visual evidence. The integration with large language models through instruction tuning allows for natural language generation of comparative descriptions, bridging visual features with clinical narratives. The creation of MedReCo-DB, a large-scale, entity-annotated multimodal dataset, provides a rich supervision source for training and evaluation. The multi-task training strategy combines contrastive ranking with generative question answering, resulting in a unified framework capable of both controllable retrieval and detailed description generation, outperforming existing state-of-the-art models.

Novelty

This research is pioneering in embedding entity-aware mechanisms into a comprehensive cross-image reasoning framework for medical imaging. Unlike prior models that rely on global features or lack explicit entity supervision, MedReCo leverages structured clinical report annotations to guide fine-grained visual alignment. The integration of a large, multi-institutional dataset with entity-level labels enables scalable supervision, addressing the scarcity of annotated data in medical AI. The combination of contrastive learning with entity-conditioned attention and the adaptation of vision-language models for detailed comparative description constitutes a novel approach, setting a new direction for clinically aligned AI systems.

Limitations

The model's performance heavily depends on the quality and completeness of structured reports; incomplete or inaccurate annotations can impair entity alignment and retrieval accuracy.
Handling extremely rare or novel diseases remains challenging due to limited training examples, necessitating further research into few-shot or zero-shot learning strategies.
High computational costs associated with training multimodal, multi-task models limit scalability and real-time deployment in resource-constrained clinical environments.
Robustness to heterogeneity in imaging protocols, equipment, and reporting standards across different institutions still requires improvement, especially for global applicability.
Future work should focus on integrating additional data modalities (e.g., genomic, clinical notes) and developing more efficient architectures to facilitate widespread clinical adoption.

Future Work

Future research directions include enhancing model robustness to rare diseases and diverse data sources, integrating multi-modal data such as genomics and electronic health records, and developing lightweight architectures for real-time deployment. Additionally, incorporating active learning and domain adaptation techniques can improve generalization across different clinical settings. Expanding the dataset with more structured annotations and longitudinal data will further refine entity-aware reasoning. Ultimately, aim to create an end-to-end clinical decision support system that seamlessly integrates into radiology workflows, providing accurate, explainable, and personalized diagnostic insights.

AI Executive Summary

The landscape of medical imaging AI has seen rapid advancements over recent years, driven by deep learning models excelling at isolated image interpretation tasks such as classification, segmentation, and report generation. Despite these technical achievements, a significant gap persists between AI capabilities and actual clinical practice, where radiologists rely heavily on comparative reasoning across multiple images and reference cases. Traditional models lack the fine-grained entity-level understanding necessary to support such nuanced comparisons, limiting their clinical utility.

This paper addresses this critical gap by proposing MedReCo, a novel entity-aware vision-language framework designed explicitly for radiological comparative reasoning. The core innovation lies in integrating structured clinical entities—anatomical structures, abnormal findings, and pathological conditions—into the visual encoding process. By leveraging a large-scale, multi-institutional dataset, MedReCo-DB, comprising over 690,000 images paired with detailed, entity-level report annotations, the authors enable the model to learn fine-grained visual features conditioned on specific clinical concepts.

The framework consists of two main components: the MedReCo visual encoder and the MedReCo-VLM generative model. The visual encoder employs modality-aware contrastive learning, augmented with entity-conditioned attention mechanisms, to produce visual representations aligned with clinical entities. This allows for controllable retrieval of similar cases based on specific entities, supporting differential diagnosis and case comparison. The MedReCo-VLM extends this visual foundation by connecting it with a large language model, enabling the generation of natural language descriptions of similarities, differences, and temporal changes between image pairs.

Extensive evaluations demonstrate the effectiveness of the approach. In internal validation, MedReCo outperformed baseline models across 12 retrieval tasks, achieving the highest Recall@1 and robustness in cross-center settings. External validation confirmed its generalization capability, with a mean improvement of 6.0 percentage points. For temporal comparison, MedReCo-VLM achieved 87.1% accuracy on public VQA benchmarks and significantly improved longitudinal follow-up accuracy on chest X-ray and CT datasets, with gains up to 46.5 percentage points.

These results highlight the potential of entity-aware cross-image reasoning to transform clinical workflows. The ability to retrieve clinically relevant cases and generate detailed, entity-specific descriptions aligns AI tools more closely with radiological reasoning processes, fostering trust and adoption. The large-scale dataset and multi-task training strategy set new standards for scalable, clinically grounded AI development. Looking ahead, future work will focus on expanding the model’s applicability to rare diseases, integrating additional data modalities, and optimizing computational efficiency, ultimately aiming to embed such systems seamlessly into routine clinical practice and improve patient outcomes.

Deep Analysis

Background

近年来，医学影像AI取得了显著进展，诸如Radiomics、深度卷积网络（如ResNet、DenseNet）以及视觉-语言模型（如VisualBERT、GIT）推动了自动诊断、报告生成和病例检索的发展。然而，现有模型多局限于全局特征，难以实现细粒度的实体级别匹配与差异描述。临床中，医生常通过比较不同时间点或不同病例的局部结构，识别微妙变化或相似病例，依赖丰富的实体信息和结构化报告。此前，少数研究尝试引入实体感知机制，但缺乏大规模、多模态、结构化的临床数据支撑，也未能系统结合跨图像推理与生成任务。随着电子病历和报告的普及，构建大规模实体标注数据集成为可能，为实体感知的深度学习提供基础。

Core Problem

传统医学影像AI模型在临床应用中面临的核心瓶颈是缺乏对细粒度实体的理解与对比能力。单纯的全局特征匹配无法捕捉局部差异，导致在鉴别微妙不同或相似病例时表现不佳。此外，缺少实体条件的可控检索和描述生成，限制了模型在临床决策中的实用性。临床中，医生需要根据特定解剖结构或病理表现进行精确比对，现有模型难以满足这种需求。解决这一问题需要引入实体感知机制，结合多模态数据和结构化报告，提升模型的细粒度理解和跨图像推理能力。

Innovation

本研究的创新点主要包括：1）提出实体感知的视觉编码机制，将解剖结构、异常表现和病理状态作为条件引导视觉特征学习，增强模型的细粒度匹配能力；2）构建MedReCo-DB数据库，利用结构化报告拆解实现大规模实体级监督，支持多模态、多机构、多模态的临床推理任务；3）设计多任务训练策略，结合对比学习和生成模型，实现实体条件的高精度检索与描述生成；4）将视觉编码器与大规模语言模型结合，支持自然语言的跨图像描述，提升模型的临床解释能力。

Methodology

�� 数据准备：从多机构、多模态的临床影像报告中提取结构化实体信息，构建MedReCo-DB。报告拆解为42个解剖结构、69个异常表现和28个病理状态，形成多层次标签体系。
�� 视觉编码：采用模态感知的对比学习机制（如模态感知对比损失），训练视觉编码器以捕获实体条件的细粒度特征。引入实体条件的注意力机制，强化模型对特定结构和异常的关注。
�� 训练策略：多任务学习框架，包括实体条件的对比损失优化（如InfoNCE）和生成式问答（采用Transformer架构的生成模型），确保模型在检索和描述任务中均表现优异。
�� 跨模态融合：将视觉特征与大规模预训练语言模型（如GPT-3或类似架构）结合，通过指令调优实现实体感知的描述生成。
�� 评估设计：在多种场景下进行验证，包括内部验证、外部验证、跨中心检索和临床易混淆的鉴别组，采用Recall@k、准确率和描述一致性指标。

Experiments

�� 数据集：构建MedReCo-DB，涵盖690,000+影像，来自8个机构，7种模态，包括胸部X光、CT、MRI和超声。划分训练集、验证集和测试集，进行多轮交叉验证。
�� 评估指标：检索任务采用Recall@1、3、5，描述生成采用BLEU、BERTScore、METEOR、RaTEScore和RadGraph F1。
�� 基线模型：比较CT-CLIP、MedCLIP、PMC-CLIP、Biomed-CLIP和HLIP等多模态检索模型，以及视觉-语言模型如VisualBERT、GIT。
�� 超参数：采用Adam优化器，学习率调度，批次大小为128，训练轮次根据验证集性能调整。
�� Ablation研究：逐步移除实体条件机制、模态感知模块和多任务训练，分析各部分对性能的贡献。

Results

�� 在内部验证中，MedReCo在12个检索任务中Recall@1最高，平均提升6.0个百分点，显著优于CT-CLIP和MedCLIP等基线。跨中心验证中，Recall@1在新机构数据上仍提升7.2-6.8个百分点，表现出良好的泛化能力。
�� 在临床易混淆的鉴别组中，模型在区分肺动脉扩张与淋巴结肿大等微妙差异方面表现优异，提升了10.9个百分点的Recall@1。
�� 在生成任务中，MedReCo-VLM在公开VQA基准上达到87.1%的准确率，肺部随访中，描述变化的准确率提升至46.5%，比基线模型高出数十个百分点。这些数据充分验证了模型在细粒度实体感知和变化描述方面的优越性能。

Applications

�� 临床病例检索辅助：医生可以利用模型快速检索与当前病例相似的历史病例，辅助诊断和鉴别诊断，提升工作效率。
�� 纵向随访分析：自动生成疾病变化的描述，帮助医生评估治疗效果或疾病进展。
�� 医学教育与培训：提供基于实体差异的标准化描述，帮助医学生和年轻医生理解细粒度的临床差异，提升诊断能力。
�� 远程医疗：支持远程影像诊断和病例比对，提升偏远地区的医疗水平。

Limitations & Outlook

�� 依赖结构化报告的质量，报告中的实体信息缺失或错误会影响模型性能。
�� 在极少见疾病或新兴疾病中表现有限，因训练数据不足。
�� 计算成本较高，模型训练和推理对硬件资源要求较大，限制了临床快速部署。未来需优化模型结构，降低复杂度，提升实时性。

Plain Language Accessible to non-experts

想象你在一家大型工厂工作，这个工厂每天都在生产各种产品。每个产品都有很多细节，比如颜色、大小、材料等。工厂的任务是根据客户的订单，找到符合要求的产品，或者告诉客户两个产品有什么不同，或者说它们之间的变化。传统的方法就像是用放大镜看每个产品的整体外观，但有时候需要关注某个具体的细节，比如颜色或材质。这个新方法就像是给工厂装上了“智能眼镜”，它可以专门关注某个细节，比如“这个产品的颜色是否变了”，或者“这个部件是不是和之前一样”。通过大量的订单和产品数据，工厂学会了如何快速准确地找到符合要求的产品，也能用自然语言告诉客户两个产品的不同之处。这样，工厂的工作变得更智能、更高效，也更贴近客户的实际需求。

ELI14 Explained like you're 14

想象你在玩一个超级复杂的拼图游戏，你需要找到两个拼图块的相似之处和不同之处。有时候你只关心拼图上的某个颜色，比如蓝色的部分；有时候你想知道两个拼图上的某个细节，比如一个有个小洞，另一个没有。以前的拼图助手只能告诉你两个拼图是不是一样，但不能告诉你它们在某个细节上的差异。现在，这个新助手就像是装了“特别的眼睛”，它可以专门关注你感兴趣的那个细节，比如“这个洞是不是变大了？”或者“颜色有没有变深？”它还可以用简单的语言告诉你：“这个拼图比之前更亮了”或者“这个部分变得更大了”。这样一来，你就可以更快、更准确地完成拼图，也能学到很多关于拼图的细节知识。这个技术就像是给医生装上了“超级眼睛”，让他们在看影像时，能更细致、更智能地找到病变的变化或相似的病例。

Abstract

Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

cs.CV cs.IR cs.LG eess.IV

A Vision-language Framework for Comparative Reasoning in Radiology

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence