Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

TL;DR

Proposes Astra framework combining RL-trained VLM policy with Bagel-based world simulator for imagination-driven spatial reasoning, improving MMSI-Bench accuracy from 45.1% to 49.5%.

cs.CV 🔴 Advanced 2026-06-05 196 views

Chenming Zhu Jingli Lin Yilin Long Peizhou Cao Tai Wang Jiangmiao Pang Xihui Liu

AI Reader Arxiv Page Download PDF

visual reasoning spatial understanding reinforcement learning world simulation multi-view reasoning

Key Findings

Methodology

This paper introduces the Astra framework, integrating an RL-trained VLM policy (Astra-VL) with Astra-WM, a Bagel-based world simulator trained with view consistency tuning. Astra-VL actively decides when to invoke Astra-WM, which generates action-conditioned novel views based on natural language camera motions. The training employs a two-phase RL curriculum: the first stabilizes tool use exploration, ensuring the model learns when and how to query the simulator; the second encourages the model to invoke the simulator only when the imagined observations improve reasoning outcomes. The world simulator is trained on large-scale multi-view datasets like IsaacSim, ScanNet, and Matterport3D, with explicit view consistency objectives to ensure spatial coherence. The entire system is evaluated on MMSI-Bench and MindCube, demonstrating that virtual views significantly enhance spatial reasoning performance when both the simulator quality and policy learning are optimized.

Key Results

Astra-WM, after view consistency tuning, improves pose and content consistency metrics from unoptimized levels (pose consistency 9.0/3.0) to 72.5/70.5, leading to an increase in Gemini-3-Flash's accuracy on MMSI-Bench from 45.1% to 49.5%.
Astra-VL, trained via RL, boosts the accuracy of the Qwen3-VL backbone from 29.8% to 38.8% on MMSI-Bench and from 36.8% to 42.7% on MindCube, outperforming baseline models that do not utilize active imagination.
The two-phase RL curriculum effectively teaches the model to balance exploration and exploitation, enabling it to invoke the simulator selectively, which results in better spatial reasoning and reduced uncertainty in multi-view tasks.

Significance

This work advances the field of visual spatial reasoning by enabling models to actively acquire unobserved scene information through virtual imagination. Moving beyond passive recognition, the proposed framework allows AI systems to simulate alternative viewpoints, akin to human mental spatial manipulation. Such capability is crucial for applications like autonomous navigation, virtual environment understanding, and robotic manipulation, especially under limited observational data. The integration of learned decision policies with high-quality world simulators addresses longstanding challenges in multi-view consistency and reasoning robustness, paving the way for more autonomous, flexible, and context-aware AI agents.

Technical Contribution

The paper's key technical innovations include: • Developing Astra-WM, a view-consistent world simulator fine-tuned with view consistency objectives, transforming generic generative models into reliable spatial simulators; • Designing a two-phase RL curriculum that trains the VLM policy to learn when and how to invoke the simulator, optimizing tool use based on reasoning context; • Implementing a multi-turn, iterative reasoning process where the model dynamically integrates imagined views to reduce spatial uncertainty. These contributions collectively enable active, strategic virtual exploration within a reinforcement learning paradigm, setting a new standard for agentic visual reasoning.

Novelty

This research is the first to systematically incorporate action-conditioned virtual view generation into an active spatial reasoning framework, trained via reinforcement learning. Unlike prior static or purely recognition-based models, Astra learns to decide when to imagine, which viewpoints to request, and how to ground virtual observations in the reasoning process. This active, policy-driven approach to virtual scene exploration represents a significant departure from existing methods that treat view synthesis as a passive or isolated task, thus pioneering a new paradigm of interactive, agentic spatial reasoning.

Limitations

Despite the improvements in view consistency, the quality of generated virtual views may still degrade in highly complex or dynamic scenes, potentially misleading reasoning if the simulated views are inaccurate.
The training process relies heavily on large-scale multi-view datasets and view consistency tuning, which incur substantial data collection and computational costs, limiting scalability.
Current experiments focus on static indoor scenes; extending the approach to dynamic, real-world environments with temporal coherence remains a challenge.
The approach assumes accurate pose estimation and camera motion commands, which may not hold in real-world robotic applications, affecting robustness.

Future Work

Future research will explore enhancing the realism and temporal coherence of virtual views, integrating multi-modal cues such as depth and semantics for richer scene understanding, and reducing reliance on large annotated datasets through self-supervised learning. Additionally, extending the framework to dynamic scenes and real-world robotic systems, where pose estimation and sensor noise are prevalent, will be crucial. Investigating more efficient RL algorithms and scalable view synthesis models will further improve practicality, aiming toward autonomous agents capable of complex spatial reasoning in real-time, unstructured environments.

AI Executive Summary

Understanding the spatial layout of a scene from limited viewpoints remains a fundamental challenge in artificial intelligence. Traditional vision-language models (VLMs) excel at recognizing objects and answering questions based on observed images, but they struggle to infer unobserved spatial relationships and scene configurations when only partial information is available. Humans, however, effortlessly fill in missing details by mentally rotating, moving, and imagining unseen parts of the environment, forming a coherent mental map that guides their actions and reasoning.

Inspired by this human ability, the present work introduces Astra, a novel framework that empowers VLMs with active imagination capabilities through interaction with a world simulator. Unlike conventional models that passively process static images, Astra enables the model to decide when to invoke a virtual environment to generate alternative viewpoints, thereby acquiring additional spatial evidence. This active approach transforms spatial reasoning into an interactive process, where the model learns to strategically seek out the most informative views.

Astra comprises two main components: Astra-VL, a reinforcement learning-trained policy that governs the reasoning process, and Astra-WM, a Bagel-based world simulator fine-tuned with view consistency objectives. The simulator is designed to produce spatially coherent virtual views conditioned on natural language camera motions, ensuring that the generated images faithfully reflect the scene’s structure. During reasoning, Astra-VL evaluates the current state, determines whether additional virtual views are needed, and issues camera-motion queries to Astra-WM. The simulator responds with imagined observations, which are then integrated into the ongoing reasoning process.

To train this system, the authors employ a two-phase reinforcement learning curriculum. The first phase stabilizes the model’s ability to invoke the simulator appropriately, avoiding over-reliance or under-utilization. The second phase encourages the model to invoke the simulator only when the imagined views are expected to improve reasoning accuracy over direct answers. This training strategy effectively teaches the model when, where, and how to imagine, leading to more accurate and robust spatial reasoning.

Experimental results demonstrate the effectiveness of Astra. On the MMSI-Bench dataset, the virtual view generator Astra-WM, after view consistency tuning, improves pose and content consistency metrics significantly, boosting the baseline accuracy from 45.1% to 49.5%. When integrated with Astra-VL, the overall accuracy on MMSI-Bench increases from 29.8% to 38.8%, and on MindCube from 36.8% to 42.7%. These improvements highlight the importance of both high-quality virtual views and learned decision policies. The results also show that active imagination, when properly trained, can substantially enhance a model’s ability to perform multi-view spatial reasoning under limited observational data.

This research opens new avenues for autonomous agents capable of strategic virtual exploration, with potential applications in robotics, virtual reality, and intelligent navigation. By enabling models to actively seek out and utilize unobserved spatial information, it moves AI closer to human-like spatial cognition. Despite current limitations related to scene complexity and data requirements, the framework establishes a foundation for future work aimed at real-time, dynamic environment understanding and reasoning. Overall, Astra represents a significant step toward more intelligent, adaptable, and perceptive AI systems capable of active spatial exploration and reasoning.

Deep Dive

Abstract

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.

cs.CV

References (20)

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhao-yu Su, Peng Xia, Hangyu Guo et al.

2025 151 citations View Analysis →

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Lu Ling, Yichen Sheng, Zhi Tu et al.

2023 451 citations View Analysis →

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Dingming Li, Hongxing Li, Zixuan Wang et al.

2025 54 citations View Analysis →

SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

Wufei Ma, Luoxin Ye, Nessa McWeeney et al.

2025 37 citations View Analysis →

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhao-yu Su, Linjie Li, Mingyang Song et al.

2025 124 citations View Analysis →

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Yong Xien Chng, T. Hu, Wenwen Tong et al.

2025 14 citations View Analysis →

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu et al.

2025 97 citations View Analysis →

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

Hongxing Li, Dingming Li, Zixuan Wang et al.

2025 44 citations View Analysis →

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Yi Han, Cheng Chi, Enshen Zhou et al.

2025 16 citations View Analysis →

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, Chenglin Zhu et al.

2025 70 citations View Analysis →

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren et al.

2025 205 citations View Analysis →

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Ganlin Yang, Tianyi Zhang, Haoran Hao et al.

2025 14 citations View Analysis →

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin et al.

2025 100 citations View Analysis →

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Angela Dai, Angel X. Chang, M. Savva et al.

2017 5486 citations View Analysis →

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

X. Lai, Junyi Li, Wei Li et al.

2025 76 citations View Analysis →

COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang et al.

2025 6 citations View Analysis →

Ross3d: Reconstructive Visual Instruction Tuning With 3D-Awareness

Haochen Wang, Yucheng Zhao, Tiancai Wang et al.

2025 53 citations View Analysis →

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.

2024 1801 citations View Analysis →

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong et al.

2025 221 citations View Analysis →

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Jun Wu, Jian Guan, Kaituo Feng et al.

2025 104 citations View Analysis →

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence