SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

TL;DR

SpatialClaw employs code as an action interface, achieving 59.9% average accuracy across 20 spatial reasoning benchmarks, outperforming recent models by 11.2%.

cs.CV 🔴 Advanced 2026-06-12 157 views

Seokju Cho Ryo Hachiuma Abhishek Badki Hang Su Byung-Kwan Lee Chan Hee Song Sifei Liu Subhashree Radhakrishnan Seungryong Kim Yu-Chiang Frank Wang Min-Hung Chen

AI Reader Arxiv Page Download PDF

spatial reasoning vision-language models tool augmentation code interface multimodal understanding

Key Findings

Methodology

SpatialClaw introduces a persistent Python kernel-based spatial reasoning framework that leverages code as the action interface. The system preloads input frames, perception primitives, geometric functions, and scientific libraries (e.g., NumPy, SciPy), enabling the model to generate, execute, and revise code iteratively. The core process involves: • Using a system prompt to guide code generation for spatial analysis; • Maintaining a persistent environment to store intermediate variables and results; • Employing a multi-turn planning-generation-execution-feedback loop to refine reasoning. This approach allows the model to flexibly compose perception outputs, inspect intermediate states, and adapt its analysis based on evolving evidence, without requiring additional training. The evaluation across 20 diverse static and dynamic 3D/4D spatial tasks demonstrates significant performance gains over single-pass and structured tool-call baselines, confirming the effectiveness of code as a versatile action interface.

Key Results

On 20 spatial reasoning benchmarks, SpatialClaw achieves an average accuracy of 59.9%, surpassing the recent spatial agent by +11.2 points. The method excels particularly in dynamic 4D video reasoning and multi-view inference tasks, where geometric chaining across frames and viewpoints is essential. The performance gains are consistent across six different backbone models, including Qwen and Gemma series, with parameters ranging from 27B to 397B, without any task-specific tuning, highlighting its strong generalization capacity.
Compared to single-pass code execution and structured tool calls, the multi-turn iterative approach enables continuous correction of intermediate results, significantly improving accuracy and robustness. Ablation studies confirm that the expressive power of code as an action interface allows for flexible combination of perception primitives like segmentation, reconstruction, and distance measurement, leading to superior performance even when external utility wrappers are removed.
The experiments demonstrate that the ability to inspect, revise, and recompose analysis steps in a persistent environment is crucial for complex spatial reasoning. The approach effectively handles tasks involving depth estimation, object relations, motion tracking, and multi-view geometry, outperforming baselines by large margins across various scenarios.

Significance

This work fundamentally rethinks the action interface for spatial reasoning agents, shifting from fixed API calls or one-shot programs to a flexible, iterative code-based approach. It addresses longstanding challenges in modeling complex 3D/4D spatial relationships, enabling models to dynamically adapt their analysis based on intermediate evidence. The method's model-agnostic design and strong transferability across diverse benchmarks and backbone architectures demonstrate its broad applicability. Its success paves the way for more intelligent, adaptable spatial reasoning systems in robotics, augmented reality, and scene understanding, where precise geometric reasoning is critical. By removing the need for task-specific training, it offers a scalable, plug-and-play solution for advancing multimodal AI capabilities.

Technical Contribution

The primary technical innovation lies in formalizing code as the action interface within a persistent Python environment, allowing multi-step, iterative spatial reasoning. Unlike prior approaches limited to single-pass programs or rigid API calls, this framework supports dynamic composition, inspection, and correction of perception outputs. Key components include: • A system prompt that encodes general spatial reasoning principles; • A persistent kernel that retains all intermediate variables and results; • Integration with scientific libraries (NumPy, SciPy) for complex geometric and numerical operations; • An outer control loop orchestrating planning, code generation, execution, feedback, and answer submission. This design significantly enhances the expressiveness and flexibility of spatial reasoning, enabling models to adaptively build complex geometric chains across multiple steps, leading to superior performance on challenging benchmarks.

Novelty

This work is the first to systematically adopt code as the action interface for spatial reasoning in multimodal models. Unlike existing methods that rely on one-shot programs or structured API calls, SpatialClaw emphasizes multi-turn, iterative code generation within a persistent environment, facilitating flexible composition and correction. Its core novelty is the integration of a persistent Python kernel with a disciplined reasoning prompt, enabling models to dynamically assemble perception primitives and geometric computations tailored to each problem. This approach unlocks new possibilities for open-ended, complex spatial analysis, setting a new standard for agent design in vision-language reasoning tasks.

Limitations

Despite its strengths, SpatialClaw depends heavily on the availability of preloaded perception tools and scientific libraries, which may limit its deployment in resource-constrained environments. The code generation process can introduce latency, especially for highly complex tasks requiring extensive iteration. Additionally, the current framework assumes access to powerful LLMs like Qwen or Gemma, which may not be feasible in all settings, potentially affecting scalability.
The approach may encounter difficulties in scenarios with ambiguous or noisy visual inputs, where intermediate code corrections become more challenging. Moreover, the reliance on large language models raises concerns about interpretability and controllability, especially in safety-critical applications. Future work should focus on optimizing code synthesis efficiency, incorporating verification mechanisms, and extending the toolset for broader geometric reasoning.
While the multi-step iterative process improves robustness, it still can fail in highly complex or ambiguous cases, requiring further enhancements in reasoning discipline and error correction strategies. Addressing these limitations will be essential for real-world deployment in robotics, autonomous navigation, and AR systems.

Future Work

未来，作者计划结合强化学习和自监督学习技术，优化代码生成的效率与准确性，特别是在动态场景和复杂几何关系中。此外，将扩展感知工具集，集成更先进的几何推理算法和场景理解模型，以提升系统在实际机器人和增强现实应用中的表现。同时，探索多模态数据融合策略，增强模型对复杂环境的空间认知能力。长远来看，目标是实现一个具有高度自主、多轮修正能力的空间推理系统，推动智能体在复杂环境中的自主导航、场景理解和交互能力的飞跃。

AI Executive Summary

空间推理作为理解三维空间中物体位置、关系与运动的核心能力，长期以来一直是视觉-语言模型（VLMs）面临的重大挑战。现有方法多依赖单次程序执行或结构化API调用，限制了推理的灵活性与复杂性，难以应对动态、多视角、多时间步的复杂场景。本文提出的SpatialClaw框架，通过引入代码作为行动接口，开创性地实现了多轮迭代式空间推理。该方法利用持久化Python内核，允许模型逐步编写、执行、观察与修正推理过程中的中间结果，从而大幅提升推理的表达能力与鲁棒性。

在20项空间推理基准测试中，SpatialClaw平均准确率达59.9%，比最新模型高出11.2个百分点，尤其在动态视频和多视角推理任务中表现优越。这一显著提升归功于其灵活的代码操作能力，使模型能够根据中间证据自主组合感知工具（如深度估计、场景重建、距离测量等），实现复杂几何关系的逐步构建与修正。实验还验证了该方法在不同的VLM骨干模型（Qwen和Gemma系列）上具有良好的迁移性，无需调优，即可获得一致性能提升。

该研究的意义在于，突破了空间推理中行动接口的传统限制，为多模态智能系统提供了更强的表达与操作能力。通过多轮修正机制，模型不仅能更准确地理解复杂场景，还能在机器人导航、增强现实、场景理解等实际应用中展现出巨大潜力。未来，作者计划结合强化学习与自监督技术，进一步提升代码生成效率，扩展感知工具集，推动空间推理向更高层次的智能化发展。

Deep Dive

Abstract

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

cs.CV cs.AI

References (20)

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

Siyi Chen, M. Uy, Chan Hee Song et al.

2025 11 citations ⭐ Influential View Analysis →

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu et al.

2025 462 citations ⭐ Influential View Analysis →

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Ellis Brown, Jihan Yang, Shusheng Yang et al.

2025 17 citations ⭐ Influential View Analysis →

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

2024 846 citations View Analysis →

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Mengdi Jia, Zekun Qi, Shaochen Zhang et al.

2025 72 citations View Analysis →

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Guo Chen, Zhiqi Li, Shihao Wang et al.

2025 53 citations View Analysis →

4DP-QA: Scalable QA for 4D Perception in Vision Language Models

Seokju Cho, Abhishek Badki, Hang Su et al.

2026 1 citations View Analysis →

Spatial Mental Modeling from Limited Views

Baiqiao Yin, Qineng Wang, Pingyue Zhang et al.

2025 52 citations

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Yi Han, Cheng Chi, Enshen Zhou et al.

2025 19 citations View Analysis →

RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

Fernando Ropero, Erkin Turkoz, Daniel Matos et al.

2026 2 citations View Analysis →

Visual Programming: Compositional visual reasoning without training

Tanmay Gupta, Aniruddha Kembhavi

2022 679 citations View Analysis →

PyVision: Agentic Vision with Dynamic Tooling

Shitian Zhao, Haoquan Zhang, Shaoheng Lin et al.

2025 49 citations View Analysis →

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li et al.

2025 127 citations View Analysis →

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Dingming Li, Hongxing Li, Zixuan Wang et al.

2025 55 citations View Analysis →

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.

2023 9117 citations View Analysis →

From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang et al.

2025 6 citations View Analysis →

pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

Zhanpeng Luo, Ce Zhang, Silong Yong et al.

2026 8 citations View Analysis →

End-to-End Object Detection with Transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve et al.

2020 18472 citations View Analysis →

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu et al.

2024 3423 citations View Analysis →

SciPy 1.0: fundamental algorithms for scientific computing in Python

Pauli Virtanen, R. Gommers, T. Oliphant et al.

2019 32414 citations

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence