Gaze Heads: How VLMs Look at What They Describe

TL;DR

This study identifies a small set of attention heads—gaze heads—in VLMs that causally track the current description region, enabling effective inference-time control via attention masks.

cs.CV 🔴 Advanced 2026-06-13 49 views

Rohit Gandikota David Bau

AI Reader Arxiv Page Download PDF

vision-language models mechanistic interpretability attention mechanisms causal control model steering

Key Findings

Methodology

This paper employs mechanistic interpretability techniques, combining correlation scoring from multiple forward passes with causal interventions. By analyzing the attention maps across layers, the authors identify a subset of heads—gaze heads—that dynamically track the image region being described. Using comic strips as a controlled testbed, they measure the heads' attention shifts in response to narrative progression. The core approach involves calculating a gaze score based on the attention mass on the queried region, then applying a targeted additive bias (attention mask) to these heads during inference. This bias effectively redirects the heads’ focus, enabling causal control over the model’s output. The method is model-agnostic, scalable across sizes (2B to 32B parameters), and architecture-agnostic, validated through experiments on comic strip and natural image datasets. The approach requires no retraining, only forward passes, making it computationally efficient and practical for real-time steering.

Key Results

In comic strip experiments, fewer than 9% of attention heads (top 100 gaze heads) could be manipulated via a single attention mask to steer the model’s description toward a specific panel with 83.1% accuracy, significantly above chance (16.7%). Random heads failed to produce such control, confirming the causal role of gaze heads.
In natural images (COCO dataset), attention maps of gaze heads spatially ground to specific objects, and targeted interventions successfully shifted the model’s description focus to the desired regions, demonstrating the generality of the mechanism.
Across models from 2B to 32B parameters, the presence and causal influence of gaze heads persisted, although some frozen-encoder architectures showed no comparable heads, indicating architecture-dependent emergence of this mechanism.

Significance

This work advances understanding of internal information routing in vision-language models by mechanistically pinpointing a small, causally effective subset of attention heads. The ability to steer model outputs at inference time through simple attention mask interventions offers a powerful, training-free tool for model interpretability, safety, and controllability. It bridges the gap between static attention visualization and causal mechanistic understanding, opening pathways for safer deployment and more transparent AI systems. The findings also suggest that internal attention dynamics are organized around a few key heads, which could inform future model design and training strategies aimed at enhancing controllability and robustness.

Technical Contribution

The paper introduces a novel correlation-based scoring method to identify attention heads that track specific image regions during description. It demonstrates that a small subset of heads—less than 10%—are causally responsible for grounding visual regions into language, validated through targeted attention mask interventions. The approach combines multi-pass attention analysis with causal manipulations, establishing a practical, inference-time control mechanism. The study extends across multiple model sizes and architectures, confirming the universality of the identified heads. It also innovates by enabling real-time, dynamic control of generated descriptions, including mid-generation region switches, without retraining. These contributions significantly deepen mechanistic understanding and provide practical tools for model steering.

Novelty

This is the first work to identify a specific, causally influential subset of attention heads—凝视头—that dynamically track and ground visual regions during description generation. Unlike prior static attention visualization or head pruning studies, this research employs a causal intervention framework, demonstrating that targeted attention mask manipulations on these heads can reliably steer model outputs. The combination of dynamic, real-time control with mechanistic identification marks a significant leap forward in understanding and manipulating complex multi-modal models, setting a new standard for interpretability and controllability in the field.

Limitations

The identified gaze heads are predominantly located in middle layers (layers 20-28), and their presence varies across architectures, especially in frozen-encoder models, indicating architecture-dependent emergence. The universality across all model types remains unconfirmed.
Interventions are currently limited to inference-time manipulations, lacking integration into training procedures, which may limit robustness and generalization in more complex or real-world scenarios.
Experiments are primarily conducted on comic strips and COCO images, which are relatively structured; applicability to more complex, unstructured real-world scenes needs further validation.
The mechanism’s reliance on attention masks may not fully capture all aspects of visual grounding, especially in cases requiring multi-head coordination or higher-level reasoning.

Future Work

Future research should explore the formation mechanisms of gaze heads during training, aiming to incorporate mechanistic constraints into training regimes for more robust and interpretable models. Extending the approach to larger, more diverse datasets and tasks, such as video understanding or multi-turn dialogues, will test the generality of the mechanism. Additionally, integrating causal control into training objectives could improve the stability and effectiveness of such interventions. Developing automated tools for identifying and manipulating key heads across architectures will further enhance model transparency and safety, paving the way for more controllable AI systems in real-world applications.

AI Executive Summary

Understanding how complex vision-language models internally route and ground visual information during description generation remains a fundamental challenge. Despite their impressive performance on tasks like image captioning and visual question answering, the internal mechanisms that enable these models to dynamically focus on relevant regions are largely opaque. Traditional interpretability methods, such as attention map visualization, provide static snapshots but fall short of establishing causal relationships between internal components and output behavior.

This gap has motivated recent efforts to dissect model internals mechanistically. In this context, the present study makes a significant breakthrough by identifying a small subset of attention heads—termed凝视头（gaze heads）—that are causally responsible for tracking the image region currently being described. Using a simple correlation score derived from multiple forward passes, the authors locate these heads predominantly in the middle layers (layers 20-28) of the model. These heads exhibit a remarkable property: their attention shifts in tandem with the narrative progression in comic strip descriptions, effectively acting as internal 'gaze trackers.'

The core innovation lies in the ability to exert causal control over the model’s output by applying a targeted attention mask to these凝视头。 When the mask biases the heads to focus on a specific region, the model’s description shifts accordingly, achieving an 83.1% success rate in steering comic panel descriptions. This control extends beyond comics to natural images, where attention maps ground spatially to objects, and targeted interventions successfully redirect the model’s focus. Importantly, this mechanism is robust across model sizes from 2 billion to 32 billion parameters, although some architectures with frozen encoders show no such heads, indicating architecture-dependent emergence.

The implications of this work are profound. It demonstrates that a tiny, identifiable subset of internal components can serve as a practical, inference-time lever for controlling multimodal model behavior without retraining. This paves the way for safer, more interpretable AI systems capable of dynamic, real-time adjustments. The authors also showcase the ability to switch the focus mid-generation, enabling models to adapt their descriptions on the fly, a feature with potential applications in interactive AI, content moderation, and explainability.

While promising, the approach has limitations. Its effectiveness varies across architectures, and the current experiments are primarily on structured datasets like comics and COCO images. Extending the mechanism to more complex, unstructured real-world scenarios remains a challenge. Future work should focus on understanding how凝视头 are formed during training, integrating causal control into training procedures, and broadening the scope to diverse tasks and datasets. Overall, this research marks a pivotal step toward mechanistic understanding and practical control of multimodal AI, promising a future where models are not only powerful but also transparent and steerable.

Deep Analysis

Background

多模态视觉-语言模型（VLM）近年来经历了快速发展，从早期的单模态预训练到后续的跨模态对齐技术（如CLIP、ALIGN），模型在图像理解和自然语言生成方面取得了突破。代表性工作包括ViLT、LXMERT、UNITER等，它们通过引入多模态注意机制，实现了图像与文本的深度融合。尽管如此，模型内部的注意力机制仍是黑箱，难以理解哪些注意头负责特定的视觉任务，如何实现信息的路由与整合。近年来，模型可解释性研究逐渐兴起，尝试识别关键注意头（如Image Heads、Localization Heads），但大多关注静态注意图或特征可视化，缺乏动态因果验证。本文在此背景下，提出机制性分析方法，识别出少数凝视头，验证其在描述任务中的因果作用，填补了模型内部因果机制理解的空白。

Core Problem

尽管多模态模型在性能上不断突破，但其内部信息路由机制仍未明晰。具体而言，模型如何动态地将视觉信息映射到语言输出，哪些注意头在描述过程中起到决定性作用，仍是未知之谜。这一问题限制了模型的可控性和安全性，尤其在需要引导模型行为的应用场景中尤为关键。传统的注意力可视化方法无法提供因果关系证据，难以判断某个注意头是否真正影响输出。解决这一问题需要一种机制性分析工具，能够识别出关键的注意子集，并验证其在模型行为中的因果作用，从而实现对模型的精细调控。

Innovation

本研究的核心创新在于提出了基于相关性评分的凝视头识别方法，结合多轮前向推理，动态追踪模型中哪些注意头在描述过程中切换关注区域。具体创新点包括：1）引入“凝视得分”指标，衡量每个头在不同描述阶段的空间追踪能力；2）利用单一注意掩码干预，有效控制模型描述的区域，实现因果操控；3）在不同模型规模和架构中重复实验，验证凝视头的普遍性，展示其作为模型调控工具的潜力。这些创新突破了静态注意分析的局限，为模型内部机制的因果理解提供了新途径。

Methodology

�� 识别凝视头：在漫画条数据集中，利用多轮前向推理，计算每个注意头的相关性得分，筛选出前100个凝视头。• 相关性评分：通过在不同描述阶段，测量每个头的注意力矩阵（attention matrix）在目标区域的集中程度，构建凝视得分。• 注意掩码干预：在模型推理过程中，施加偏置，将目标区域的注意力强制放大（+∞），抑制其他区域，从而引导模型描述特定区域。• 动态切换：在生成过程中，实时切换凝视头的关注目标，观察模型描述的变化，验证其因果控制能力。• 多模型验证：在不同模型（2B至32B参数）和架构中重复实验，验证凝视头的普遍性。• 实验评估：采用漫画面板描述、自然图像（COCO）区域定位、视觉问答（VQA）等任务，量化干预效果，确保统计显著性。

Experiments

实验设计包括在漫画数据集（COMICS）上识别凝视头，验证其空间追踪能力；在自然图像（COCO）验证空间grounding；在VQA任务中施加注意掩码，评估描述偏向性。模型为Qwen3-VL-8B，参数量为32亿，采用eager attention机制。通过在不同模型规模和架构中重复识别和干预，确保机制的普适性。关键指标包括凝视得分、干预成功率（83.1%在漫画中，类似在自然图像中）、模型描述的区域偏差等。还设计了动态切换实验，验证模型在生成过程中的实时调控能力。所有实验均在随机抽样和统计检验下进行，确保结果的可靠性。

Results

识别出的凝视头主要集中在模型中后层（第20-28层），在多模型中表现一致。通过单一注意掩码干预，模型在漫画描述任务中，将描述区域成功引导到目标面板的概率达到83.1%，远高于随机头干预的效果。自然图像中，干预后模型描述的区域显著偏向目标对象，验证了空间grounding能力。动态切换实验显示，模型能在每50个生成标记后，快速调整描述区域，保持描述连贯性。不同模型规模中，凝视头的识别和控制效果均优异，说明该机制具有一定的普适性。

Applications

该机制可以应用于多模态内容生成、交互式AI、自动化内容审核等场景，通过实时操控模型关注区域，实现内容定向生成和行为调节。在安全和可控性方面，提供了无需重新训练的快速调控手段。未来还可结合训练优化，增强模型对复杂场景的适应性和鲁棒性，推动多模态系统的可解释性和安全性提升。

Limitations & Outlook

当前方法主要在漫画和COCO数据集上验证，实际复杂场景中的效果和稳定性仍需验证。干预效果依赖模型架构，部分冻结编码器架构未表现出凝视头机制。干预仅在推理阶段实现，缺乏训练时机制优化，可能影响鲁棒性。未来需研究凝视头的形成机制、在更大规模、多任务环境中的表现，以及如何结合训练过程优化机制的稳定性。

Plain Language Accessible to non-experts

想象你在厨房里做饭，厨师（模型）有很多不同的助手（注意头），每个助手都在关注不同的食材或锅碗瓢盆。有些助手特别擅长盯着某个特定的食材，比如蔬菜或肉块，确保它们被正确处理。研究发现，厨房里有少数几个特别的助手（凝视头），它们会专注于当前厨师正在描述或处理的食材。比如，当厨师说“把蔬菜炒熟”，这些助手会集中注意力在蔬菜上，确保描述和操作都准确无误。研究人员用一种特殊的方法，能在厨房里临时告诉这些助手“去看那边的肉”，让厨师的描述也跟着改变，变成“把肉煎熟”。这样，厨师的行为变得可以被控制和引导，而不用重新训练整个厨房。这个发现帮助我们理解模型内部的“注意力助手”是如何工作的，也让我们可以用简单的操作，控制模型在描述图片时关注的区域，从而实现更智能、更可控的内容生成。

ELI14 Explained like you're 14

想象你在玩一个超级复杂的拼图游戏，你有很多不同的小帮手（注意头）在帮你找拼图的不同部分。有些帮手特别擅长盯着某一块拼图，比如天空或者树木。当你告诉他们“去看天空”，他们就会专注在天空那一块，把拼图拼得更快更好。研究发现，在这个拼图游戏中，只有少数几个帮手（大约不到10%）真正在跟踪你说的那一块区域。更酷的是，你可以用一种特殊的“魔法”让他们去看别的地方，比如“去看树木”，然后拼图的焦点就会变成树木。这样，你不用重新训练整个游戏，只要动动手指，就可以控制帮手们的注意力，让拼图变得更有趣、更有控制感。这就像给模型装上了“注意力遥控器”，让它在描述图片时，能听你的指挥，关注你想要的部分。这个发现让我们更懂模型是怎么“看”和“说话”的，也让未来的AI变得更聪明、更容易控制。

Abstract

How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at https://gaze.baulab.info/

cs.CV cs.CL cs.LG

References (20)

MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding

Jingyuan Deng, Yujiu Yang

2025 6 citations ⭐ Influential View Analysis →

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Seil Kang, Jinyeong Kim, Junhyeok Kim et al.

2025 75 citations ⭐ Influential View Analysis →

Attention Is Not Only a Weight: Analyzing Transformers with Vector Norms

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi et al.

2020 281 citations

The Manga Whisperer: Automatically Generating Transcriptions for Comics

Ragav Sachdeva, Andrew Zisserman

2024 27 citations View Analysis →

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou et al.

2025 86 citations View Analysis →

Efficient Multimodal Learning from Data-centric Perspective

Muyang He, Yexin Liu, Boya Wu et al.

2024 148 citations View Analysis →

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

2024 4287 citations View Analysis →

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

2024 850 citations View Analysis →

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

Emanuele Vivoli, Marco Bertini, Dimosthenis Karatzas

2024 16 citations View Analysis →

Are Sixteen Heads Really Better than One?

Paul Michel, Omer Levy, Graham Neubig

2019 1374 citations View Analysis →

Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models

Ido Cohen, Daniela Gottesman, Mor Geva et al.

2024 10 citations View Analysis →

The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives

Mohit Iyyer, Varun Manjunatha, Anupam Guha et al.

2016 115 citations View Analysis →

From Panels to Prose: Generating Literary Narratives from Comics

Ragav Sachdeva, Andrew Zisserman

2025 8 citations View Analysis →

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda et al.

2022 894 citations View Analysis →

One missing piece in Vision and Language: A Survey on Comics Understanding

Emanuele Vivoli, Andrey Barsky, Mohamed Ali Souibgui et al.

2024 12 citations View Analysis →

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Phillip Y. Lee, Jihyeon Je, Chanho Park et al.

2025 36 citations View Analysis →

Visual symbolic mechanisms: Emergent symbol processing in vision language models

Rim Assouel, Declan Campbell, Taylor Webb

2025 11 citations View Analysis →

Towards Automated Circuit Discovery for Mechanistic Interpretability

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch et al.

2023 664 citations View Analysis →

Steering Language Models With Activation Engineering

A. M. Turner, Lisa Thiergart, Gavin Leech et al.

2023 594 citations View Analysis →

Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

Anupam Pani, Yanchao Yang

2025 8 citations View Analysis →

Gaze Heads: How VLMs Look at What They Describe

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence