Normal Guidance is what Attention Needs

TL;DR

Proposed Normal Guidance regularization improves attention-based MIL slice-level localization on 4M+ CT slices, outperforming baselines.

cs.LG 🔴 Advanced 2026-05-27 82 views

Ethan Harvey Dennis Johan Loevlie Michael C. Hughes

AI Reader Arxiv Page Download PDF

Multiple Instance Learning Attention Mechanism Medical Imaging Weak Supervision Regularization

Key Findings

Methodology

This paper addresses weakly supervised classification of 3D medical images by proposing Normal Guidance, a regularization technique that encourages the learned attention weights in multiple instance learning (MIL) to follow a bell-shaped (normal) distribution. Using a frozen Vision Transformer (ViT) encoder to extract slice-level embeddings, the method computes the empirical mean and variance of attention weights per scan to construct a discrete normal distribution as a reference. The model minimizes divergence (KL divergence or squared error) between the learned attention distribution and this reference, guiding attention to focus spatially near the center slices while retaining data-driven flexibility. The approach is compatible with attention-based MIL (ABMIL) and transformer-based MIL (TransMIL), and extends to multi-head attention by applying separate normal guidance per head, effectively modeling multiple spatial foci. Experiments span three large-scale CT datasets (head, chest, abdomen) with over 4 million 2D slices, training under strict weak supervision (only scan-level binary labels) and evaluating slice-level localization and scan-level classification.

Key Results

Normal Guidance achieves slice-level localization AUROC of 0.871 (Head CT), 0.866 (Chest CT), and 0.663 (Abdomen CT), significantly outperforming ABMIL, TransMIL, and Smooth Operator baselines, as well as surpassing a simple center-focused Gaussian baseline (e.g., Chest CT baseline 0.78 vs. NG 0.866).
At scan-level classification, Normal Guidance maintains competitive performance, e.g., 0.925 AUROC on Head CT, close to the instance-label upper bound of 0.927, demonstrating that localization regularization does not compromise overall classification.
Multi-Head Normal Guidance further improves localization, particularly in scenarios with multiple spatially disjoint lesions, achieving 0.706 AUROC on a semi-synthetic dataset near the best-in-class ceiling of 0.884.

Significance

This work addresses a critical challenge in weakly supervised 3D medical image analysis: accurate slice-level lesion localization with only volume-level labels. It reveals that existing attention-based MIL methods lack sufficient spatial inductive bias, as a simple center-focused Gaussian baseline outperforms them. By integrating a normal distribution prior into attention regularization, the proposed Normal Guidance method effectively combines clinical spatial priors with data-driven learning, substantially improving slice-level localization while preserving scan-level classification. This advances interpretability and trustworthiness in clinical AI, reduces reliance on costly slice-level annotations, and lays groundwork for more precise and explainable weakly supervised medical imaging models.

Technical Contribution

The paper introduces a novel regularization framework that guides learned attention weights toward a discrete normal distribution parameterized by empirical mean and variance, a fundamental departure from prior uniform or entropy-based attention regularizations. It further extends this to multi-head attention, enabling modeling of multiple spatial foci akin to a Gaussian mixture model. The approach is implemented atop frozen ViT embeddings with linear classifiers, balancing computational efficiency and performance. Extensive experiments on large-scale, diverse CT datasets validate the method’s superiority in localization and competitiveness in classification, establishing new state-of-the-art results and practical upper bounds for weakly supervised MIL in 3D medical imaging.

Novelty

This study is the first to systematically incorporate a normal distribution prior as a guiding reference for attention weights in MIL for 3D medical image localization. Unlike prior works focusing on uniform or entropy-based regularization, Normal Guidance explicitly encodes spatial inductive bias reflecting clinical intuition that lesions cluster near central slices. The multi-head extension uniquely addresses multi-lesion scenarios by enabling multiple bell-shaped attention modes. This fundamentally shifts attention regularization from generic smoothing to clinically informed spatial guidance, filling a critical gap in weakly supervised medical imaging literature.

Limitations

The method does not explicitly model attention distributions for negative bags, leaving open how to regularize attention when no lesion is present, which may affect interpretability and generalization.
Attention weights, although better aligned with expert annotations, do not guarantee causal explanations for model decisions, limiting the trustworthiness of attention as an explanation.
Using a frozen ViT encoder and linear classifier simplifies training but restricts model expressiveness; end-to-end fine-tuning could improve performance but at significant computational cost.

Future Work

Future directions include developing attention regularization strategies tailored for negative bags to improve interpretability, conditioning the reference distribution on bag labels for dynamic guidance, exploring parameter-efficient fine-tuning to enhance encoder adaptability, and extending Normal Guidance to more complex spatial dependencies in whole-slide pathology images and multi-modal medical imaging, thereby broadening clinical applicability.

AI Executive Summary

Automated analysis of 3D medical images is pivotal for modern clinical diagnostics, yet accurately localizing lesions at the slice level under weak supervision remains a formidable challenge. Traditional multiple instance learning (MIL) approaches employ attention mechanisms to assign weights to individual slices, aiming to identify disease-relevant regions. However, recent findings reveal that a simple center-focused Gaussian baseline, which ignores image content, can outperform sophisticated attention and transformer-based MIL models in brain CT localization tasks, exposing a critical limitation in current methods’ ability to leverage spatial priors.

Motivated by this insight, the authors propose Normal Guidance, a novel regularization technique that encourages the learned attention weights to conform to a bell-shaped distribution centered on the scan’s middle slices. This approach integrates clinical spatial priors with data-driven learning, guiding attention to focus on anatomically plausible regions while retaining flexibility to adapt to image-specific features. The method is compatible with both attention-based MIL (ABMIL) and transformer-based MIL (TransMIL), and extends to multi-head attention, allowing simultaneous focus on multiple spatially distinct lesions.

Technically, Normal Guidance computes the empirical mean and variance of attention weights per scan to construct a discrete normal distribution as a reference. During training, the model minimizes the divergence between the learned attention distribution and this reference via KL divergence or squared error, effectively regularizing attention towards clinically meaningful spatial patterns. This strategy enhances interpretability and localization accuracy without compromising whole-scan classification performance.

Extensive experiments on three large-scale CT datasets encompassing head, chest, and abdomen scans with over 4 million 2D slices demonstrate that Normal Guidance significantly outperforms existing MIL baselines and the center-focused Gaussian baseline in slice-level localization AUROC (e.g., 0.871 on Head CT). The method maintains competitive scan-level classification accuracy close to upper bounds derived from instance-level labels. Multi-head Normal Guidance further improves performance in multi-lesion scenarios, underscoring its practical utility.

This work advances the state-of-the-art in weakly supervised 3D medical image analysis by introducing a principled spatial inductive bias into attention mechanisms. It enhances the clinical interpretability and trustworthiness of AI models, reduces dependence on costly slice-level annotations, and sets a new benchmark for slice-level localization. Future research will explore attention modeling for negative samples, dynamic label-conditioned regularization, and integration with end-to-end fine-tuning and multi-modal imaging to broaden clinical impact.

Deep Analysis

Background

Deep learning has revolutionized medical image analysis, particularly for 3D imaging modalities such as CT and MRI. Accurate detection and localization of lesions within volumetric scans are critical for diagnosis and treatment planning. However, acquiring detailed slice-level annotations is prohibitively expensive and time-consuming, motivating the use of weakly supervised learning approaches that rely solely on volume-level labels. Multiple Instance Learning (MIL) has emerged as a prominent framework in this context, modeling each 3D scan as a bag of 2D slice instances with only bag-level labels available during training. Attention-based MIL methods, such as ABMIL, assign weights to slices to highlight regions likely contributing to the diagnosis, offering interpretability and localization capabilities. Transformer-based MIL models, like TransMIL, further incorporate dependencies among slices via self-attention mechanisms. Despite these advances, recent studies have shown that a simple center-focused Gaussian baseline, which ignores image content and assigns attention weights based solely on slice position, can outperform complex attention and transformer-based MIL models in brain CT localization tasks. This counterintuitive finding highlights a critical gap: existing MIL attention mechanisms lack sufficient spatial inductive bias to effectively leverage anatomical priors inherent in medical imaging. Addressing this limitation is essential for improving slice-level lesion localization under weak supervision.

Core Problem

The core problem addressed is how to accurately localize lesion-bearing slices within 3D medical images using only coarse, volume-level binary labels during training. Existing attention-based MIL methods generate attention weights per slice but often fail to outperform naive spatial baselines, indicating insufficient incorporation of spatial prior knowledge. This deficiency hampers clinical applicability, as precise slice-level localization is crucial for interpretability, trust, and diagnostic efficiency. The challenge lies in designing mechanisms that effectively integrate spatial inductive biases reflecting clinical knowledge—such as lesions tending to cluster near central slices—while preserving the flexibility to adapt to image-specific features. Moreover, achieving this without sacrificing overall scan-level classification accuracy and under computational constraints remains difficult. The problem is compounded by the presence of multiple, spatially disjoint lesions and varying scan lengths, necessitating robust and generalizable solutions.

Innovation

The paper introduces several key innovations:

1. Normal Guidance Regularization: A novel framework that guides learned attention weights toward a discrete normal distribution parameterized by the empirical mean and variance of attention per scan. This explicitly encodes spatial prior knowledge that lesions are more likely near the scan center, addressing the lack of inductive bias in prior MIL attention mechanisms.

2. Multi-Head Normal Guidance: An extension for transformer-based MIL that applies separate normal guidance to each attention head, enabling the model to attend to multiple spatially disjoint regions simultaneously. This overcomes the unimodal limitation of single normal guidance and better models complex lesion distributions.

3. Comprehensive Empirical Validation: Extensive experiments on three large-scale, diverse CT datasets totaling over 4 million slices demonstrate the method’s superiority in slice-level localization and competitive scan-level classification, establishing new state-of-the-art results and practical performance ceilings.

These innovations collectively advance weakly supervised 3D medical image analysis by integrating clinically meaningful spatial priors into attention mechanisms, enhancing interpretability and localization accuracy.

Methodology

Method details:

�� Input Preparation: Each 3D CT scan is decomposed into a variable number of 2D axial slices. Each slice is encoded into a fixed-length embedding vector using a frozen Vision Transformer (ViT) pretrained on ImageNet, producing instance-level features.

�� MIL Framework: The set of slice embeddings forms a bag input to MIL models. Two architectures are considered: ABMIL, which uses a learned attention pooling mechanism to aggregate slice embeddings, and TransMIL, which employs multi-head self-attention to model inter-slice dependencies.

�� Normal Guidance Construction:
For each scan, compute the empirical mean (μ) and variance (σ²) of the learned attention weights across slices.
Construct a discrete normal probability density function (PDF) over slice indices using μ and σ².
Normalize this PDF to form the reference attention distribution.

�� Regularization Objective:
Define a divergence metric (forward KL divergence, reverse KL divergence, or squared error) between the learned attention distribution and the reference normal distribution.
Incorporate this divergence as a regularization term weighted by a hyperparameter λ into the standard binary cross-entropy loss for scan-level classification.
During backpropagation, apply stop-gradient to the reference distribution to prevent its parameters from updating.

�� Multi-Head Extension:
For TransMIL’s multi-head attention, compute separate normal reference distributions per head.
Average the divergence terms across heads to form the total regularization loss.

�� Training Setup:
Use only scan-level binary labels for training, validation, and hyperparameter tuning.
Evaluate slice-level localization using expert-annotated slice labels only at test time.
Optimize using stochastic gradient descent with momentum, early stopping based on validation AUROC, and grid search over learning rates and λ.

This methodology effectively biases attention weights toward clinically plausible spatial patterns while preserving data-driven adaptability.

Experiments

Experimental design:

�� Datasets: Three large-scale public CT datasets covering different anatomical regions and pathologies:
Head CT: 21,744 scans, 752,803 slices, intracranial hemorrhage labels.
Chest CT: 7,279 scans, 1,790,594 slices, pulmonary embolism labels.
Abdomen CT: 4,711 scans, 1,500,653 slices, abdominal trauma labels.

�� Baselines: ABMIL, TransMIL, Smooth Operator (a smoothing-based MIL method), and a center-focused Gaussian baseline ignoring image content.

�� Metrics: Slice-level localization evaluated by AUROC on positive bags only; scan-level classification evaluated by AUROC; AUPRC results provided in appendix.

�� Upper Bounds: Constructed best-in-class ceilings for localization and classification using instance-level labels and oracle pooling to contextualize results.

�� Training Details: Frozen ViT encoder with embedding size 768; linear classifier head; batch size 64; 1000 epochs; early stopping on validation AUROC; hyperparameter grid search.

�� Ablations: Compared divergence types (forward KL, reverse KL, squared error), single vs. multi-head guidance, and regularization strengths.

�� Reproducibility: Three random train/validation/test splits stratified by patient and class; results averaged with standard deviations reported.

This rigorous setup ensures robust evaluation of Normal Guidance’s effectiveness and generalizability.

Results

Key results:

�� Slice-Level Localization: Normal Guidance achieves AUROC of 0.871 (Head CT), 0.866 (Chest CT), and 0.663 (Abdomen CT), outperforming ABMIL, TransMIL, Smooth Operator, and the center-focused Gaussian baseline (e.g., Chest CT baseline 0.78 vs. NG 0.866).

�� Scan-Level Classification: Maintains competitive AUROC (e.g., 0.925 on Head CT), close to upper bounds derived from instance-level labels, indicating no trade-off between localization regularization and classification accuracy.

�� Multi-Head Normal Guidance: Further improves localization, especially in multi-lesion scenarios, reaching 0.706 AUROC on semi-synthetic data near the best-in-class ceiling of 0.884.

�� Divergence Choice: Forward KL divergence yields slightly better localization than reverse KL or squared error.

�� Sensitivity Analysis: Moderate regularization strength optimizes performance; overly strong or weak regularization degrades results.

�� Qualitative Analysis: Attention maps guided by Normal Guidance are more focused and aligned with expert annotations, enhancing interpretability.

�� Transformer MIL vs. ABMIL: Transformer-based models marginally improve classification but not localization without Normal Guidance, underscoring the importance of spatial prior regularization.

Applications

Applications:

�� Weakly Supervised Lesion Localization: Enables accurate slice-level lesion identification in 3D medical images using only volume-level labels, reducing annotation burden and facilitating clinical workflows.

�� Clinical Decision Support: Integrates into radiology pipelines to highlight suspicious slices, aiding radiologists in diagnosis, improving efficiency and trust.

�� Multi-Organ CT Analysis: Applicable across brain, chest, and abdomen CT scans, supporting diverse pathologies such as hemorrhage, embolism, and trauma.

�� Extension to Other Modalities: Potential adaptation to MRI, PET, and whole-slide pathology images for broader medical imaging tasks.

�� Research Tool: Provides a benchmark and framework for developing spatially informed weakly supervised learning methods in medical imaging.

Limitations & Outlook

Limitations:

�� Negative Bag Attention: The method does not explicitly regularize attention distributions for negative bags, leaving ambiguity in attention behavior when no lesion is present, which may affect model interpretability and robustness.

�� Causal Interpretability: Although attention aligns better with expert annotations, it does not guarantee causal attribution, limiting the reliability of attention as an explanation for model decisions.

�� Model Expressiveness: Using a frozen ViT encoder and linear classifier restricts representational capacity; end-to-end fine-tuning could improve results but at high computational cost.

�� Computational Resources: Training transformer-based MIL models with Normal Guidance is resource-intensive, potentially limiting scalability.

�� Dataset Variability: Performance gaps on abdomen CT may reflect organ coverage variability and incomplete scans, suggesting dataset-specific challenges.

Abstract

We consider training classifiers for 3D medical images using only one binary label for the entire volume rather than a label for each 2D slice. In such weakly supervised settings, can we learn accurate classifiers for slice-level predictions? Attention-based multiple instance learning (MIL) can produce an attention score for every slice. Yet recent work demonstrates that a simple center-focused baseline that ignores image content can outperform attention-based and transformer-based MIL at slice-level classification of 3D brain scans. We show this baseline also outperforms existing MIL at slice-level classification of thoracic and abdominal CT scans. Motivated by this baseline, we propose Normal Guidance, a regularization technique that encourages the learned attention distribution to follow a bell-shaped curve. Across three medical imaging datasets totaling over 4 million 2D slices, we show our Normal Guidance enables attention-based and transformer-based MIL methods to deliver significantly better slice-level localization than the state-of-the-art while remaining competitive at whole-scan classification.

cs.LG

References (20)

Sm: enhanced localization in Multiple Instance Learning for medical imaging classification

Francisco M. Castro-Mac'ias, Pablo Morales-Álvarez, Yunan Wu et al.

2024 16 citations ⭐ Influential View Analysis →

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication

Zhucheng Shao, Hao Bian, Yang Chen et al.

2021 1208 citations ⭐ Influential View Analysis →

Data-efficient and weakly supervised computational pathology on whole-slide images

Ming Y. Lu, Drew F. K. Williamson, Tiffany Y. Chen et al.

2020 2065 citations ⭐ Influential View Analysis →

Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram Classification

Wentao Zhu, Qi Lou, Y. S. Vang et al.

2016 296 citations View Analysis →

Combining Attention-based Multiple Instance Learning and Gaussian Processes for CT Hemorrhage Detection

Yunan Wu, Arne Schmidt, E. Hernández-Sánchez et al.

2021 33 citations

Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection

H. Baker, Austin J. Brockmeier

2025 1 citations View Analysis →

Patched Diffusion Models for Unsupervised Anomaly Detection in Brain MRI

Finn Behrendt, Debayan Bhattacharya, Julia Kruger et al.

2023 73 citations View Analysis →

Synthetic Data Reveals Generalization Gaps in Correlated Multiple Instance Learning

Ethan Harvey, D. Loevlie, Michael C. Hughes

2025 1 citations View Analysis →

PSA-MIL: A Probabilistic Spatial Attention-Based Multiple Instance Learning for Whole Slide Image Classification

Sharon Peled, Y. Maruvka, Moti Freiman

2025 1 citations View Analysis →

Real-World Anomaly Detection in Surveillance Videos

Waqas Sultani, Chen Chen, M. Shah

2018 1972 citations View Analysis →

Semantics-Aware Attention Guidance for Diagnosing Whole Slide Images

Kechun Liu, Wenjun Wu, J. Elmore et al.

2024 6 citations View Analysis →

Recommendations for Processing Head CT Data

J. Muschelli

2019 53 citations

ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases

Xiaosong Wang, Yifan Peng, Le Lu et al.

2017 3381 citations View Analysis →

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, R. Socher et al.

2009 73292 citations

Reproducibility in Multiple Instance Learning: A Case For Algorithmic Unit Tests

Edward Raff, James Holt

2023 14 citations View Analysis →

RSNA 2023 Abdominal Trauma AI Challenge Review and Outcomes Analysis.

Sebastiaan Hermans, Zixuan Hu, Robyn L. Ball et al.

2024 12 citations

Detecting Heart Disease from Multi-View Ultrasound Images via Supervised Attention Multiple Instance Learning

Zhe Huang, B. Wessler, M. Hughes

2023 13 citations View Analysis →

Deep MIML Network

Ji Feng, Zhi-Hua Zhou

2017 198 citations

Dual-stream Multiple Instance Learning Network for Whole Slide Image Classification with Self-supervised Contrastive Learning

Bin Li, Yin Li, K. Eliceiri

2020 939 citations View Analysis →

The RSNA Pulmonary Embolism CT Dataset.

E. Colak, F. Kitamura, Stephen Hobbs et al.

2021 121 citations

Normal Guidance is what Attention Needs

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

References (20)

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies