How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

TL;DR

Study reveals LLMs and VLMs struggle with viewpoint rotation understanding without vision, proposes VRUBench dataset, and improves performance via selective fine-tuning.

cs.AI 🔴 Advanced 2026-04-17 35 views

Zhen Yang Ping Jian Zhongbin Guo Zuming Zhang Chengzhi Li Yonghong Deng Xinyue Zhang Wenpeng Lu

AI Reader Arxiv Page Download PDF

spatial intelligence viewpoint rotation interpretability study selective fine-tuning dataset analysis

Key Findings

Methodology

The study constructs the VRUBench dataset to explore LLMs and VLMs' ability to understand viewpoint rotation without visual information. It employs layer-wise probing analysis and head-wise causal intervention to reveal that models encode viewpoint information in hidden states but struggle to bind viewpoint positions with corresponding observations. Selective fine-tuning of key attention heads improves viewpoint rotation understanding performance.

Key Results

Result 1: On the VRUBench dataset, LLMs and VLMs perform poorly, with the highest accuracy only reaching 77.5%, while humans easily achieve 100%. This indicates a significant gap in spatial intelligence capabilities.
Result 2: Selective fine-tuning of key attention heads significantly improves VRU performance while avoiding catastrophic forgetting of general abilities.
Result 3: Experiments show VLMs outperform LLMs even without visual input, suggesting visual data training benefits text-based spatial tasks.

Significance

The study highlights the limitations of current large models in understanding spatial relationships without visual information, emphasizing the importance of visual data training for enhancing spatial intelligence. By employing selective fine-tuning, the research provides a method to improve task-specific performance without compromising general abilities, offering new insights for future spatial intelligence model development.

Technical Contribution

The research introduces the VRUBench dataset focused on text-based viewpoint rotation understanding tasks and reveals limitations in models' ability to encode viewpoint information through layer-wise probing and head-wise causal intervention. By selectively fine-tuning key attention heads, it demonstrates an effective method to enhance task-specific performance while avoiding catastrophic forgetting.

Novelty

This study is the first to systematically explore LLMs and VLMs' understanding of viewpoint rotation without visual information. By innovatively employing selective fine-tuning, the research significantly improves task-specific performance without compromising general abilities.

Limitations

Limitation 1: Despite improvements in VRU performance, models still struggle with more complex spatial tasks, indicating a need for further enhancement.
Limitation 2: The study focuses on text-based viewpoint rotation understanding, lacking comprehensive analysis of multimodal inputs.
Limitation 3: The scalability and generalizability of the current method on large-scale datasets remain to be further validated.

Future Work

Future research could explore more complex spatial tasks, integrating multimodal inputs to enhance spatial intelligence. Additionally, further optimization of selective fine-tuning strategies is needed to improve scalability and generalizability on large-scale datasets.

AI Executive Summary

In recent years, spatial intelligence has become a research hotspot in the field of artificial intelligence, particularly with the development of large language models (LLMs) and vision-language models (VLMs). However, existing research largely focuses on visual-spatial intelligence, neglecting the question of whether linguistic intelligence alone can endow models with spatial intelligence without visual information. This study addresses this issue by exploring LLMs and VLMs' ability to understand viewpoint rotation without visual information.

The research constructs the VRUBench dataset to systematically evaluate LLMs and VLMs' ability to understand viewpoint rotation with text-only inputs. Results show that while humans can easily achieve 100% accuracy on this task, current models perform far below expectations, with the highest accuracy only reaching 77.5%. This indicates a significant gap in spatial intelligence capabilities.

To uncover the underlying mechanisms of models' viewpoint rotation understanding, the study employs layer-wise probing analysis and head-wise causal intervention. Findings reveal that although models encode viewpoint information in hidden states, they struggle to bind viewpoint positions with corresponding observations, leading to hallucinations in the final layers.

To address this issue, the study selectively fine-tunes key attention heads, significantly improving models' viewpoint rotation understanding performance while avoiding catastrophic forgetting of general abilities. Experimental results demonstrate that selective fine-tuning not only enhances task-specific performance but also preserves models' general capabilities.

This study not only highlights the limitations of current models in spatial intelligence but also provides new directions for future model development. By employing selective fine-tuning, the research offers an effective method to improve task-specific performance without compromising general abilities, providing new insights for future spatial intelligence model development.

Deep Analysis

Background

Spatial intelligence involves the ability to perceive and mentally manipulate spatial relationships. Recently, with the development of large language models (LLMs) and vision-language models (VLMs), research on spatial intelligence has gained increasing attention. Traditionally, research on spatial intelligence has focused on visual-spatial intelligence, where models acquire spatial information through visual inputs. However, spatial intelligence is not limited to visual perception; even blind individuals can perceive space through other senses (Gardner, 1983). Therefore, it is significant to study spatial intelligence without visual information. Existing research primarily focuses on benchmarking and improving spatial intelligence using visual data, while the understanding of viewpoint rotation without visual information remains underexplored.

Core Problem

The core problem of this study is to explore whether LLMs and VLMs can understand viewpoint rotation without visual information, specifically viewpoint rotation understanding (VRU). Models need to infer their final viewpoint position and predict the corresponding observation after receiving textual descriptions of multi-step viewpoint rotations and observations. While humans can easily achieve 100% accuracy on this task, current models perform far below expectations, indicating a significant gap in spatial intelligence capabilities.

Innovation

The core innovations of this study include:

1. Proposing the VRUBench dataset, focused on text-based viewpoint rotation understanding tasks, providing a new benchmark for evaluating models' spatial intelligence without visual information.

2. Employing layer-wise probing analysis and head-wise causal intervention to reveal models' limitations in encoding viewpoint information, particularly in binding viewpoint positions with corresponding observations.

3. Significantly improving models' viewpoint rotation understanding performance through selective fine-tuning of key attention heads, while avoiding catastrophic forgetting of general abilities.

Methodology

The methodology of this study includes the following key steps:

�� Dataset Construction: Design the VRUBench dataset, providing textual descriptions of multi-step viewpoint rotations and observations, requiring models to predict the final observation.

�� Layer-wise Probing Analysis: Evaluate models' ability to encode viewpoint information at different layers, revealing limitations in encoding viewpoint information in hidden states.

�� Head-wise Causal Intervention: Use path patching techniques to identify key attention heads with significant impact on viewpoint rotation understanding.

�� Selective Fine-tuning: Fine-tune identified key attention heads to improve models' performance on viewpoint rotation understanding tasks.

Experiments

The experimental design includes the following aspects:

�� Dataset: Use the VRUBench dataset to evaluate models' viewpoint rotation understanding capabilities, containing textual descriptions of multi-step viewpoint rotations and observations.

�� Baselines: Select multiple LLMs and VLMs as baseline models, including LLaMA2-7B-chat, Qwen2.5-VL series, etc.

�� Evaluation Metrics: Use the accuracy of observation predictions as the evaluation metric, comparing different models' performance on the VRUBench dataset.

�� Ablation Studies: Evaluate the impact of selectively fine-tuning key attention heads on model performance.

Results

Results analysis shows that while humans can easily achieve 100% accuracy on the VRUBench dataset, current LLMs and VLMs perform far below expectations, with the highest accuracy only reaching 77.5%. This indicates a significant gap in spatial intelligence capabilities. Selective fine-tuning of key attention heads significantly improves models' viewpoint rotation understanding performance while avoiding catastrophic forgetting of general abilities. Experiments also show that VLMs outperform LLMs even without visual input, suggesting that visual data training benefits text-based spatial tasks.

Applications

The application scenarios of this study include:

�� AI Assistants: Enhance AI assistants' spatial understanding capabilities without visual information, improving their performance in navigation, description, and other tasks.

�� Education: Provide better spatial intelligence support for AI applications in education, helping students understand complex spatial relationships.

�� Robot Navigation: Improve robots' navigation capabilities without visual information, enhancing their adaptability in complex environments.

Limitations & Outlook

Despite the progress made in improving models' viewpoint rotation understanding performance, there are still some limitations. First, while selective fine-tuning improves VRU performance, models still struggle with more complex spatial tasks, indicating a need for further enhancement. Second, the study focuses on text-based viewpoint rotation understanding, lacking comprehensive analysis of multimodal inputs. Additionally, the scalability and generalizability of the current method on large-scale datasets remain to be further validated. Future research could explore more complex spatial tasks, integrating multimodal inputs to enhance spatial intelligence.

Plain Language Accessible to non-experts

Imagine you're in a completely dark room, holding a compass. You can't see anything around you, only relying on the compass to judge your direction. Now, you need to turn around a few times in this room and then tell someone which direction you're facing. This is the core of the viewpoint rotation understanding task: judging your direction based solely on textual descriptions without visual information.

In this study, scientists wanted to know if language models like ChatGPT could understand direction changes through text as humans do. They designed a series of tasks where these models had to determine their direction without visual information.

The results showed that these models performed much worse than humans in this aspect. To improve this, researchers conducted special training focused on the parts of the model responsible for direction judgment. After such training, the models' performance improved, but they still lagged behind humans.

This study tells us that while language models excel in many tasks, they still need further improvement in tasks requiring spatial perception. In the future, scientists might combine more sensory information to enhance these models' spatial intelligence.

ELI14 Explained like you're 14

Hey there! Have you ever thought about what it would be like if you were in a dark room, couldn't see anything, and had to rely on your sense of direction? That's what scientists are studying, a skill called viewpoint rotation understanding.

Imagine you're playing a game where your character is in a maze, and you have to turn them based on text clues to find the exit. Scientists want to know if smart assistants like ChatGPT can find directions through text like we do.

They gave these assistants tasks to try and figure out their direction without any visual help. The results showed that these assistants didn't do as well as us humans. To make them smarter, scientists gave them special training, focusing on the parts of their brains responsible for direction. After this training, they got better, but still not as good as us.

This shows that while these assistants are smart, they still need to keep learning and improving in some tasks. In the future, scientists might make these assistants even smarter by combining more types of information!

Glossary

Viewpoint Rotation Understanding

The ability to judge and understand changes in one's viewpoint position through textual descriptions without visual information.

Used in the study to evaluate models' spatial intelligence without visual information.

Large Language Model (LLM)

An AI model trained on large amounts of text data, capable of generating and understanding natural language.

Used in the study to evaluate viewpoint rotation understanding without visual information.

Vision-Language Model (VLM)

An AI model trained on both visual and language data, capable of handling multimodal tasks.

Used in the study to compare viewpoint rotation understanding without visual information.

Layer-wise Probing Analysis

An analysis method that evaluates a model's ability to encode specific information by examining hidden states at different layers.

Used to reveal limitations in encoding viewpoint information.

Head-wise Causal Intervention

A method that assesses the causal impact of specific attention heads on model outputs by intervening in their activations.

Used to identify key attention heads affecting viewpoint rotation understanding.

Selective Fine-tuning

A strategy of fine-tuning only specific parts of a model to enhance task-specific performance while preserving general abilities.

Used to improve performance on viewpoint rotation understanding tasks.

VRUBench Dataset

A dataset focused on text-based viewpoint rotation understanding tasks, used to evaluate models' spatial intelligence without visual information.

Used in the study to evaluate models' viewpoint rotation understanding capabilities.

Hallucination

A phenomenon where a model generates outputs that do not accurately bind input information, leading to unrealistic results.

Describes performance issues in viewpoint rotation understanding tasks.

Self-attention Mechanism

A mechanism in neural networks that dynamically adjusts the weights between different parts of an input sequence.

Key mechanism for models encoding viewpoint information.

Catastrophic Forgetting

A phenomenon where a model's performance on previously learned tasks significantly declines when trained on new tasks.

An issue avoided through selective fine-tuning in the study.

Open Questions Unanswered questions from this research

1 Current models still have limited spatial intelligence without visual information, especially in complex multi-step viewpoint rotation tasks. This indicates existing methods fall short in handling complex spatial relationships, requiring further research to enhance models' spatial perception capabilities.
2 Although selective fine-tuning improves task-specific performance, its scalability and generalizability on large-scale datasets have not been fully validated. Future research needs to explore more efficient fine-tuning strategies to enhance adaptability across different tasks.
3 The study focuses on text-based viewpoint rotation understanding, lacking comprehensive analysis of multimodal inputs. Integrating visual, auditory, and other multimodal information could significantly enhance models' spatial intelligence.
4 Current research experiments are primarily based on simulated environments and have yet to be validated in real-world scenarios. Future work needs to test model performance in more realistic settings to ensure reliability in practical applications.
5 While selective fine-tuning avoids catastrophic forgetting, it has limited effects on enhancing models' general abilities. Future research should explore how to further enhance general abilities while improving task-specific performance.

Applications

Immediate Applications

AI Assistants

Enhance AI assistants' spatial understanding capabilities without visual information, improving their performance in navigation, description, and other tasks.

Education

Provide better spatial intelligence support for AI applications in education, helping students understand complex spatial relationships.

Robot Navigation

Improve robots' navigation capabilities without visual information, enhancing their adaptability in complex environments.

Long-term Vision

Smart Cities

Enhance AI systems' spatial intelligence to achieve more efficient city management and resource allocation, promoting smart city development.

Human-Computer Interaction

Integrate multimodal information to enhance the naturalness and intelligence of human-computer interaction, achieving more efficient collaboration and communication.

Abstract

Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed dataset while human can easily achieve 100% accuracy, indicating a substantial gap between current model capabilities and the requirements of spatial intelligence. To uncover the underlying mechanisms, we conduct a layer-wise probing analysis and head-wise causal intervention. Our findings reveal that although models encode viewpoint information in the hidden states, they appear to struggle to bind the viewpoint position with corresponding observation, resulting in a hallucination in final layers. Finally, we selectively fine-tune the key attention heads identified by causal intervention to improve VRU performance. Experimental results demonstrate that such selective fine-tuning achieves improved VRU performance while avoiding catastrophic forgetting of generic abilities. Our dataset and code will be released at https://github.com/Young-Zhen/VRU_Interpret .

cs.AI

References (20)

Computing Krippendorff's Alpha-Reliability

K. Krippendorff

2011 1370 citations ⭐ Influential

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali Gupta et al.

2024 479 citations ⭐ Influential View Analysis →

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen et al.

2025 1201 citations View Analysis →

Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs

Zhikai Ding, Shiyu Ni, Keping Bi

2025 2 citations View Analysis →

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014 164918 citations View Analysis →

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang et al.

2024 212 citations View Analysis →

How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study

Tianjie Ju, Weiwei Sun, Wei Du et al.

2024 69 citations View Analysis →

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao et al.

2022 2355 citations View Analysis →

P2FTrack: Multi-Object Tracking with Motion Prior and Feature Posterior

Hong Zhang, Jiaxu Wan, Jing Zhang et al.

2024 13 citations

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

Le Yu, Yu Bowen, Haiyang Yu et al.

2023 584 citations View Analysis →

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Fred Zhang, Neel Nanda

2023 223 citations View Analysis →

On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.

2024 58 citations View Analysis →

Does Spatial Cognition Emerge in Frontier Models?

Santhosh K. Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl et al.

2024 59 citations View Analysis →

Dual coding theory and education

James M. Clark, A. Paivio

1991 2072 citations

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

Zhongbin Guo, Zhen Yang, Yushan Li et al.

2026 2 citations View Analysis →

MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

Chenxi Wang, Xiang Chen, Ningyu Zhang et al.

2024 71 citations View Analysis →

Interpreting and Improving Large Language Models in Arithmetic Calculation

Wei Zhang, Chaoqun Wan, Yonggang Zhang et al.

2024 45 citations View Analysis →

Safety Alignment Should Be Made More Than Just A Few Attention Heads

Chao Huang, Zefeng Zhang, Juewei Yue et al.

2025 2 citations View Analysis →

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning

Yihong Tang, A. Qu, Zhaokai Wang et al.

2024 11 citations

Scaling Laws for Neural Language Models

J. Kaplan, Sam McCandlish, T. Henighan et al.

2020 7555 citations View Analysis →

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Viewpoint Rotation Understanding

Large Language Model (LLM)

Vision-Language Model (VLM)

Layer-wise Probing Analysis

Head-wise Causal Intervention

Selective Fine-tuning

VRUBench Dataset

Hallucination

Self-attention Mechanism

Catastrophic Forgetting

Open Questions Unanswered questions from this research

Applications

Immediate Applications

AI Assistants

Education

Robot Navigation

Long-term Vision

Smart Cities

Human-Computer Interaction

Abstract

References (20)

Related Papers

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Large Language Models Exhibit Normative Conformity