Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators
Proposes Astra framework combining RL-trained VLM policy with Bagel-based world simulator for imagination-driven spatial reasoning, improving MMSI-Bench accuracy from 45.1% to 49.5%.
Key Findings
Methodology
This paper introduces the Astra framework, integrating an RL-trained VLM policy (Astra-VL) with Astra-WM, a Bagel-based world simulator trained with view consistency tuning. Astra-VL actively decides when to invoke Astra-WM, which generates action-conditioned novel views based on natural language camera motions. The training employs a two-phase RL curriculum: the first stabilizes tool use exploration, ensuring the model learns when and how to query the simulator; the second encourages the model to invoke the simulator only when the imagined observations improve reasoning outcomes. The world simulator is trained on large-scale multi-view datasets like IsaacSim, ScanNet, and Matterport3D, with explicit view consistency objectives to ensure spatial coherence. The entire system is evaluated on MMSI-Bench and MindCube, demonstrating that virtual views significantly enhance spatial reasoning performance when both the simulator quality and policy learning are optimized.
Key Results
- Astra-WM, after view consistency tuning, improves pose and content consistency metrics from unoptimized levels (pose consistency 9.0/3.0) to 72.5/70.5, leading to an increase in Gemini-3-Flash's accuracy on MMSI-Bench from 45.1% to 49.5%.
- Astra-VL, trained via RL, boosts the accuracy of the Qwen3-VL backbone from 29.8% to 38.8% on MMSI-Bench and from 36.8% to 42.7% on MindCube, outperforming baseline models that do not utilize active imagination.
- The two-phase RL curriculum effectively teaches the model to balance exploration and exploitation, enabling it to invoke the simulator selectively, which results in better spatial reasoning and reduced uncertainty in multi-view tasks.
Significance
This work advances the field of visual spatial reasoning by enabling models to actively acquire unobserved scene information through virtual imagination. Moving beyond passive recognition, the proposed framework allows AI systems to simulate alternative viewpoints, akin to human mental spatial manipulation. Such capability is crucial for applications like autonomous navigation, virtual environment understanding, and robotic manipulation, especially under limited observational data. The integration of learned decision policies with high-quality world simulators addresses longstanding challenges in multi-view consistency and reasoning robustness, paving the way for more autonomous, flexible, and context-aware AI agents.
Technical Contribution
The paper's key technical innovations include: β’ Developing Astra-WM, a view-consistent world simulator fine-tuned with view consistency objectives, transforming generic generative models into reliable spatial simulators; β’ Designing a two-phase RL curriculum that trains the VLM policy to learn when and how to invoke the simulator, optimizing tool use based on reasoning context; β’ Implementing a multi-turn, iterative reasoning process where the model dynamically integrates imagined views to reduce spatial uncertainty. These contributions collectively enable active, strategic virtual exploration within a reinforcement learning paradigm, setting a new standard for agentic visual reasoning.
Novelty
This research is the first to systematically incorporate action-conditioned virtual view generation into an active spatial reasoning framework, trained via reinforcement learning. Unlike prior static or purely recognition-based models, Astra learns to decide when to imagine, which viewpoints to request, and how to ground virtual observations in the reasoning process. This active, policy-driven approach to virtual scene exploration represents a significant departure from existing methods that treat view synthesis as a passive or isolated task, thus pioneering a new paradigm of interactive, agentic spatial reasoning.
Limitations
- Despite the improvements in view consistency, the quality of generated virtual views may still degrade in highly complex or dynamic scenes, potentially misleading reasoning if the simulated views are inaccurate.
- The training process relies heavily on large-scale multi-view datasets and view consistency tuning, which incur substantial data collection and computational costs, limiting scalability.
- Current experiments focus on static indoor scenes; extending the approach to dynamic, real-world environments with temporal coherence remains a challenge.
- The approach assumes accurate pose estimation and camera motion commands, which may not hold in real-world robotic applications, affecting robustness.
Future Work
Future research will explore enhancing the realism and temporal coherence of virtual views, integrating multi-modal cues such as depth and semantics for richer scene understanding, and reducing reliance on large annotated datasets through self-supervised learning. Additionally, extending the framework to dynamic scenes and real-world robotic systems, where pose estimation and sensor noise are prevalent, will be crucial. Investigating more efficient RL algorithms and scalable view synthesis models will further improve practicality, aiming toward autonomous agents capable of complex spatial reasoning in real-time, unstructured environments.
AI Executive Summary
Understanding the spatial layout of a scene from limited viewpoints remains a fundamental challenge in artificial intelligence. Traditional vision-language models (VLMs) excel at recognizing objects and answering questions based on observed images, but they struggle to infer unobserved spatial relationships and scene configurations when only partial information is available. Humans, however, effortlessly fill in missing details by mentally rotating, moving, and imagining unseen parts of the environment, forming a coherent mental map that guides their actions and reasoning.
Inspired by this human ability, the present work introduces Astra, a novel framework that empowers VLMs with active imagination capabilities through interaction with a world simulator. Unlike conventional models that passively process static images, Astra enables the model to decide when to invoke a virtual environment to generate alternative viewpoints, thereby acquiring additional spatial evidence. This active approach transforms spatial reasoning into an interactive process, where the model learns to strategically seek out the most informative views.
Astra comprises two main components: Astra-VL, a reinforcement learning-trained policy that governs the reasoning process, and Astra-WM, a Bagel-based world simulator fine-tuned with view consistency objectives. The simulator is designed to produce spatially coherent virtual views conditioned on natural language camera motions, ensuring that the generated images faithfully reflect the sceneβs structure. During reasoning, Astra-VL evaluates the current state, determines whether additional virtual views are needed, and issues camera-motion queries to Astra-WM. The simulator responds with imagined observations, which are then integrated into the ongoing reasoning process.
To train this system, the authors employ a two-phase reinforcement learning curriculum. The first phase stabilizes the modelβs ability to invoke the simulator appropriately, avoiding over-reliance or under-utilization. The second phase encourages the model to invoke the simulator only when the imagined views are expected to improve reasoning accuracy over direct answers. This training strategy effectively teaches the model when, where, and how to imagine, leading to more accurate and robust spatial reasoning.
Experimental results demonstrate the effectiveness of Astra. On the MMSI-Bench dataset, the virtual view generator Astra-WM, after view consistency tuning, improves pose and content consistency metrics significantly, boosting the baseline accuracy from 45.1% to 49.5%. When integrated with Astra-VL, the overall accuracy on MMSI-Bench increases from 29.8% to 38.8%, and on MindCube from 36.8% to 42.7%. These improvements highlight the importance of both high-quality virtual views and learned decision policies. The results also show that active imagination, when properly trained, can substantially enhance a modelβs ability to perform multi-view spatial reasoning under limited observational data.
This research opens new avenues for autonomous agents capable of strategic virtual exploration, with potential applications in robotics, virtual reality, and intelligent navigation. By enabling models to actively seek out and utilize unobserved spatial information, it moves AI closer to human-like spatial cognition. Despite current limitations related to scene complexity and data requirements, the framework establishes a foundation for future work aimed at real-time, dynamic environment understanding and reasoning. Overall, Astra represents a significant step toward more intelligent, adaptable, and perceptive AI systems capable of active spatial exploration and reasoning.
Deep Dive
Abstract
While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.
References (20)
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhao-yu Su, Peng Xia, Hangyu Guo et al.
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision
Lu Ling, Yichen Sheng, Zhi Tu et al.
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Dingming Li, Hongxing Li, Zixuan Wang et al.
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Wufei Ma, Luoxin Ye, Nessa McWeeney et al.
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhao-yu Su, Linjie Li, Mingyang Song et al.
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
Yong Xien Chng, T. Hu, Wenwen Tong et al.
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Kun Ouyang, Yuanxin Liu, Haoning Wu et al.
SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
Hongxing Li, Dingming Li, Zixuan Wang et al.
TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
Yi Han, Cheng Chi, Enshen Zhou et al.
DeepEyesV2: Toward Agentic Multimodal Model
Jack Hong, Chenxiao Zhao, Chenglin Zhu et al.
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren et al.
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
Ganlin Yang, Tianyi Zhang, Haoran Hao et al.
Thyme: Think Beyond Images
Yi-Fan Zhang, Xingyu Lu, Shukang Yin et al.
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
Angela Dai, Angel X. Chang, M. Savva et al.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
X. Lai, Junyi Li, Wei Li et al.
COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence
Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang et al.
Ross3d: Reconstructive Visual Instruction Tuning With 3D-Awareness
Haochen Wang, Yucheng Zhao, Tiancai Wang et al.
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong et al.
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Jun Wu, Jian Guan, Kaituo Feng et al.