SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
SpatialClaw employs code as an action interface, achieving 59.9% average accuracy across 20 spatial reasoning benchmarks, outperforming recent models by 11.2%.
Key Findings
Methodology
SpatialClaw introduces a persistent Python kernel-based spatial reasoning framework that leverages code as the action interface. The system preloads input frames, perception primitives, geometric functions, and scientific libraries (e.g., NumPy, SciPy), enabling the model to generate, execute, and revise code iteratively. The core process involves: • Using a system prompt to guide code generation for spatial analysis; • Maintaining a persistent environment to store intermediate variables and results; • Employing a multi-turn planning-generation-execution-feedback loop to refine reasoning. This approach allows the model to flexibly compose perception outputs, inspect intermediate states, and adapt its analysis based on evolving evidence, without requiring additional training. The evaluation across 20 diverse static and dynamic 3D/4D spatial tasks demonstrates significant performance gains over single-pass and structured tool-call baselines, confirming the effectiveness of code as a versatile action interface.
Key Results
- On 20 spatial reasoning benchmarks, SpatialClaw achieves an average accuracy of 59.9%, surpassing the recent spatial agent by +11.2 points. The method excels particularly in dynamic 4D video reasoning and multi-view inference tasks, where geometric chaining across frames and viewpoints is essential. The performance gains are consistent across six different backbone models, including Qwen and Gemma series, with parameters ranging from 27B to 397B, without any task-specific tuning, highlighting its strong generalization capacity.
- Compared to single-pass code execution and structured tool calls, the multi-turn iterative approach enables continuous correction of intermediate results, significantly improving accuracy and robustness. Ablation studies confirm that the expressive power of code as an action interface allows for flexible combination of perception primitives like segmentation, reconstruction, and distance measurement, leading to superior performance even when external utility wrappers are removed.
- The experiments demonstrate that the ability to inspect, revise, and recompose analysis steps in a persistent environment is crucial for complex spatial reasoning. The approach effectively handles tasks involving depth estimation, object relations, motion tracking, and multi-view geometry, outperforming baselines by large margins across various scenarios.
Significance
This work fundamentally rethinks the action interface for spatial reasoning agents, shifting from fixed API calls or one-shot programs to a flexible, iterative code-based approach. It addresses longstanding challenges in modeling complex 3D/4D spatial relationships, enabling models to dynamically adapt their analysis based on intermediate evidence. The method's model-agnostic design and strong transferability across diverse benchmarks and backbone architectures demonstrate its broad applicability. Its success paves the way for more intelligent, adaptable spatial reasoning systems in robotics, augmented reality, and scene understanding, where precise geometric reasoning is critical. By removing the need for task-specific training, it offers a scalable, plug-and-play solution for advancing multimodal AI capabilities.
Technical Contribution
The primary technical innovation lies in formalizing code as the action interface within a persistent Python environment, allowing multi-step, iterative spatial reasoning. Unlike prior approaches limited to single-pass programs or rigid API calls, this framework supports dynamic composition, inspection, and correction of perception outputs. Key components include: • A system prompt that encodes general spatial reasoning principles; • A persistent kernel that retains all intermediate variables and results; • Integration with scientific libraries (NumPy, SciPy) for complex geometric and numerical operations; • An outer control loop orchestrating planning, code generation, execution, feedback, and answer submission. This design significantly enhances the expressiveness and flexibility of spatial reasoning, enabling models to adaptively build complex geometric chains across multiple steps, leading to superior performance on challenging benchmarks.
Novelty
This work is the first to systematically adopt code as the action interface for spatial reasoning in multimodal models. Unlike existing methods that rely on one-shot programs or structured API calls, SpatialClaw emphasizes multi-turn, iterative code generation within a persistent environment, facilitating flexible composition and correction. Its core novelty is the integration of a persistent Python kernel with a disciplined reasoning prompt, enabling models to dynamically assemble perception primitives and geometric computations tailored to each problem. This approach unlocks new possibilities for open-ended, complex spatial analysis, setting a new standard for agent design in vision-language reasoning tasks.
Limitations
- Despite its strengths, SpatialClaw depends heavily on the availability of preloaded perception tools and scientific libraries, which may limit its deployment in resource-constrained environments. The code generation process can introduce latency, especially for highly complex tasks requiring extensive iteration. Additionally, the current framework assumes access to powerful LLMs like Qwen or Gemma, which may not be feasible in all settings, potentially affecting scalability.
- The approach may encounter difficulties in scenarios with ambiguous or noisy visual inputs, where intermediate code corrections become more challenging. Moreover, the reliance on large language models raises concerns about interpretability and controllability, especially in safety-critical applications. Future work should focus on optimizing code synthesis efficiency, incorporating verification mechanisms, and extending the toolset for broader geometric reasoning.
- While the multi-step iterative process improves robustness, it still can fail in highly complex or ambiguous cases, requiring further enhancements in reasoning discipline and error correction strategies. Addressing these limitations will be essential for real-world deployment in robotics, autonomous navigation, and AR systems.
Future Work
未来,作者计划结合强化学习和自监督学习技术,优化代码生成的效率与准确性,特别是在动态场景和复杂几何关系中。此外,将扩展感知工具集,集成更先进的几何推理算法和场景理解模型,以提升系统在实际机器人和增强现实应用中的表现。同时,探索多模态数据融合策略,增强模型对复杂环境的空间认知能力。长远来看,目标是实现一个具有高度自主、多轮修正能力的空间推理系统,推动智能体在复杂环境中的自主导航、场景理解和交互能力的飞跃。
AI Executive Summary
空间推理作为理解三维空间中物体位置、关系与运动的核心能力,长期以来一直是视觉-语言模型(VLMs)面临的重大挑战。现有方法多依赖单次程序执行或结构化API调用,限制了推理的灵活性与复杂性,难以应对动态、多视角、多时间步的复杂场景。本文提出的SpatialClaw框架,通过引入代码作为行动接口,开创性地实现了多轮迭代式空间推理。该方法利用持久化Python内核,允许模型逐步编写、执行、观察与修正推理过程中的中间结果,从而大幅提升推理的表达能力与鲁棒性。
在20项空间推理基准测试中,SpatialClaw平均准确率达59.9%,比最新模型高出11.2个百分点,尤其在动态视频和多视角推理任务中表现优越。这一显著提升归功于其灵活的代码操作能力,使模型能够根据中间证据自主组合感知工具(如深度估计、场景重建、距离测量等),实现复杂几何关系的逐步构建与修正。实验还验证了该方法在不同的VLM骨干模型(Qwen和Gemma系列)上具有良好的迁移性,无需调优,即可获得一致性能提升。
该研究的意义在于,突破了空间推理中行动接口的传统限制,为多模态智能系统提供了更强的表达与操作能力。通过多轮修正机制,模型不仅能更准确地理解复杂场景,还能在机器人导航、增强现实、场景理解等实际应用中展现出巨大潜力。未来,作者计划结合强化学习与自监督技术,进一步提升代码生成效率,扩展感知工具集,推动空间推理向更高层次的智能化发展。
Deep Dive
Abstract
Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.
References (20)
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Siyi Chen, M. Uy, Chan Hee Song et al.
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu et al.
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Ellis Brown, Jihan Yang, Shusheng Yang et al.
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani et al.
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia, Zekun Qi, Shaochen Zhang et al.
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
Guo Chen, Zhiqi Li, Shihao Wang et al.
4DP-QA: Scalable QA for 4D Perception in Vision Language Models
Seokju Cho, Abhishek Badki, Hang Su et al.
Spatial Mental Modeling from Limited Views
Baiqiao Yin, Qineng Wang, Pingyue Zhang et al.
TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
Yi Han, Cheng Chi, Enshen Zhou et al.
RieMind: Geometry-Grounded Spatial Agent for Scene Understanding
Fernando Ropero, Erkin Turkoz, Daniel Matos et al.
Visual Programming: Compositional visual reasoning without training
Tanmay Gupta, Aniruddha Kembhavi
PyVision: Agentic Vision with Dynamic Tooling
Shitian Zhao, Haoquan Zhang, Shaoheng Lin et al.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li et al.
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Dingming Li, Hongxing Li, Zixuan Wang et al.
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang et al.
pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
Zhanpeng Luo, Ce Zhang, Silong Yong et al.
End-to-End Object Detection with Transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve et al.
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu et al.
SciPy 1.0: fundamental algorithms for scientific computing in Python
Pauli Virtanen, R. Gommers, T. Oliphant et al.