Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

TL;DR

This study introduces a multi-turn multi-agent dialogue framework to evaluate VLMs in spatial reasoning, showing limited improvements mainly due to visual grounding challenges.

cs.CL 🔴 Advanced 2026-05-29 60 views
Chalamalasetti Kranti Sherzod Hakimov David Schlangen
Vision-Language Models Spatial Reasoning Multi-turn Dialogue Robotics Interaction Structure Reconstruction

Key Findings

Methodology

This paper develops a multi-turn multi-agent dialogue setup where two VLMs collaborate to reconstruct a target structure on a grid. The task utilizes the dataset from Kranti et al., which provides goal structures rendered as images and Python code. One VLM acts as the instruction generator, receiving the target structure in textual and visual forms, and produces instructions for object placement. The other acts as the instruction follower, executing commands based on dialogue and current grid state. The interaction is orchestrated by a Game Master that validates responses, executes code, and manages turn-taking. The models are evaluated across different input modalities (text-only, image-only, text+image), image representations (top-down, layered), and structural complexities. Performance is measured via exact match success rate, with detailed analysis of dialogue behaviors and error propagation.

Key Results

  • Single VLM models show very low success rates, with GPT-5.2-Chat achieving 21.8% success in multi-turn multi-agent settings, and Qwen3-VL-30B only 3.6%. Multi-turn interactions slightly improve performance but remain limited, indicating persistent challenges in spatial understanding.
  • Introducing layered target images significantly boosts success rates; GPT-5.2-Chat reaches 49.3%, Qwen3-VL-30B reaches 18%. Text-based target representations outperform image-only inputs, with success rates of 91.7% versus 16-23%.
  • Role division (instruction generator vs. executor) provides marginal gains; however, performance remains constrained by visual occlusion and complex spatial relations. Clarification questions are frequent but do not always lead to recovery, highlighting the difficulty of spatial grounding.

Significance

This research underscores the fundamental limitations of current VLMs in spatial reasoning, especially in multi-turn collaborative contexts. It highlights the importance of better visual spatial grounding and multimodal fusion strategies to advance autonomous robotics and human-robot interaction. The findings inform future directions in model architecture, dataset design, and training paradigms aimed at robust spatial understanding, which is crucial for practical deployment in real-world scenarios such as assembly, navigation, and collaborative tasks.

Technical Contribution

The paper introduces a novel multi-turn multi-agent dialogue framework that dissects the stages of spatial interpretation, instruction generation, and execution. It innovatively employs role separation and layered visual representations to analyze their impact on spatial reasoning. The experimental setup, based on Kranti’s dataset, enables systematic evaluation of model performance across modalities and complexities. The study also provides detailed dialogue behavior analysis, revealing key error sources and propagation pathways, thus offering insights into the limitations of current VLMs in spatial grounding tasks.

Novelty

This is the first comprehensive study to incorporate multi-turn multi-agent dialogue for spatial structure reconstruction, emphasizing role division and layered visual inputs. Unlike prior work focused on static or single-turn tasks, this research explores the dynamics of collaborative reasoning, providing new insights into how dialogue and multimodal representations influence spatial understanding. Its systematic analysis of failure modes and performance bottlenecks marks a significant step forward in the field of multimodal spatial reasoning.

Limitations

  • The experiments are limited to simple 2.5D structures with up to five components, which do not fully reflect the complexity of real-world spatial perception and manipulation tasks.
  • Models rely heavily on visual spatial groundings, which are inherently limited by occlusion and 2D representations, lacking true 3D understanding.
  • Despite multi-turn interactions, the models still frequently fail in spatial reasoning, indicating that current architectures lack robust mechanisms for deep spatial relationship modeling, especially in complex scenarios.

Future Work

Future research should focus on integrating richer 3D spatial representations, such as point clouds and depth maps, to improve spatial grounding. Developing advanced multimodal fusion techniques and reasoning modules could significantly enhance model robustness. Extending the framework to real robot platforms and more complex structures will be crucial for practical applications. Additionally, exploring learning-based approaches for dialogue management and error correction could further improve collaborative spatial reasoning capabilities.

AI Executive Summary

In the rapidly evolving field of robotics and artificial intelligence, enabling autonomous agents to understand and manipulate complex spatial structures remains a fundamental challenge. Vision-language models (VLMs) have shown promise in tasks like image captioning and visual question answering, but their capacity for spatial reasoning, especially in collaborative multi-turn dialogues, is still limited. This study addresses this gap by proposing a multi-turn multi-agent dialogue framework designed to evaluate and enhance the spatial reasoning abilities of VLMs.

The core idea is to simulate a collaborative environment where two models—one generating instructions based on a target structure, and the other executing these instructions—work together to reconstruct a given spatial configuration. The dataset used is derived from Kranti et al., featuring goal structures rendered as images and Python code, which allows precise evaluation of the models’ understanding. The interaction involves multiple rounds of dialogue, with the instruction generator receiving the target in various modalities (text, images, layered images), and the executor translating instructions into actions.

Experimental results reveal that current models perform poorly in this task, with success rates below 25% even in multi-turn settings. Incorporating layered images improves performance significantly, especially for GPT-5.2-Chat, which reaches nearly 50%. Text descriptions outperform image inputs, highlighting the importance of explicit, structured information. Despite these improvements, models still struggle with complex structures, occlusion, and stacking relations, often requiring multiple clarifications that do not always lead to successful reconstruction.

These findings underscore the persistent difficulty of visual spatial grounding in AI models. They suggest that current architectures lack the necessary mechanisms for deep spatial relationship understanding, particularly in multi-step, multi-modal interactions. The research points toward future directions involving richer 3D representations, advanced multimodal fusion, and learning-based dialogue management to overcome these limitations.

Overall, this work provides a systematic framework and comprehensive analysis for evaluating and improving VLMs in spatial reasoning tasks. It highlights critical bottlenecks and offers insights that could guide the development of more capable autonomous systems, capable of understanding and manipulating the physical world with human-like spatial awareness. The implications extend across robotics, manufacturing, and human-robot collaboration, marking a significant step toward truly intelligent spatial reasoning in AI.

Deep Analysis

Background

近年来,随着深度学习和多模态技术的发展,视觉语言模型(VLM)在图像描述、视觉问答和跨模态检索等任务中取得了显著进展。代表性工作如Feng et al.(2019)提出的VQA和image captioning,为模型理解视觉内容提供了基础。Krishna et al.(2020)构建了大规模多模态数据集,推动了多模态融合技术的发展。Li et al.(2022)结合深度神经网络和符号推理,尝试提升空间关系的理解能力。尽管如此,模型在空间关系的深层推理、堆叠关系和遮挡场景中的表现仍有限,尤其在多轮对话和协作任务中,空间地面化能力不足成为瓶颈。现有研究多集中在静态场景或单轮任务,缺乏对多轮交互中空间推理的系统分析。

Core Problem

当前VLM在空间推理中的主要难点在于空间关系的准确理解和空间地面化能力不足。模型在多轮对话中难以保持一致的空间认知,尤其在存在遮挡、堆叠和复杂结构的情况下,容易出现堆叠错误、位置偏差和颜色遗漏。缺乏有效的多模态融合策略,使得模型在理解目标结构和指令时信息不足,导致重建成功率极低。解决这些问题对于机器人自主操作、空间导航和人机协作具有重要意义,但现有模型在空间关系推理、指令理解和多轮交互中的表现仍不理想,亟需系统性研究和改进。

Innovation

本研究的核心创新在于引入多轮多智能体对话框架,系统分析空间推理中的瓶颈。具体包括:

  • �� 角色分工机制,将指令生成与执行任务分离,模拟真实机器人中的任务分配,提升模型的任务理解能力。
  • �� 多模态输入设计,结合整体视图和分层堆叠视图,有效缓解遮挡和堆叠关系的理解难题。
  • �� 采用结构成功率作为评价指标,结合对话行为分析,深入揭示模型在空间推理中的具体表现和误差来源。
  • �� 实验中引入多轮交互和澄清机制,评估模型在复杂结构和多模态信息下的空间推理能力,推动多模态空间理解技术的发展。

Methodology

  • �� 数据集:采用Kranti等人提出的2.5D结构数据集,包含目标结构的Python代码和渲染图像,模拟真实空间布局。
  • �� 角色设计:指令生成者(Programmer)负责生成构建目标的指令,执行者(Robot)根据指令执行空间操作。
  • �� 多模态输入:指令生成者接收目标结构的文本描述和分层图像,执行者接收指令文本和当前空间状态的图像。
  • �� 对话流程:通过多轮对话,指令生成者逐步提供指令,执行者执行指令并反馈状态,直至结构重建成功或达到最大轮次。
  • �� 评估指标:采用结构成功率(Exact Match)衡量重建精度,分析对话行为(澄清、修正)对性能的影响。
  • �� 模型:使用Qwen3-VL-30B(开源)和GPT-5.2-Chat(闭源)两种模型,比较不同模态和角色分工的效果。

Experiments

  • �� 设计了单轮和多轮交互场景,评估模型在不同复杂度结构(2-5元素)下的表现。
  • �� 采用不同输入模态(文本、整体图像、分层图像)进行对比,分析信息丰富度对推理效果的影响。
  • �� 实验中,模型在最大15轮对话后进行成功率统计,结合对话行为分析,探讨模型在空间理解中的瓶颈。
  • �� 通过角色互换(指令生成者与执行者互换模型)验证模型的泛化能力。
  • �� 还进行了消融实验,验证分层图像和文本描述对性能的贡献。

Results

  • �� 单一VLM模型在空间推理中的成功率极低,GPT-5.2-Chat在多轮多智能体设置中的成功率为21.8%,Qwen3-VL-30B仅为3.6%。
  • �� 引入分层图像显著提升性能,GPT-5.2-Chat成功率提升至49.3%,Qwen3-VL-30B至18%。
  • �� 文本描述的目标结构在所有模态中表现优越,成功率达91.7%,远超纯图像输入(约16-23%)。
  • �� 多轮交互带来一定提升,但模型在复杂结构和多轮中仍频繁出现堆叠错误、颜色遗漏等问题。
  • �� 澄清行为虽频繁发生(97-100%的 episodes),但未能根本改善成功率,反映出空间地面化的根本难点。

Applications

  • �� 机器人自主装配:利用模型理解空间关系,实现自动化装配线的智能控制。
  • �� 人机协作:增强机器人在复杂环境中的空间推理能力,提高人机交互效率。
  • �� 智能制造:在工业场景中,模型可辅助空间布局优化和缺陷检测。
  • �� 教育培训:作为教学辅助工具,帮助学生理解空间关系和结构设计。

Limitations & Outlook

  • �� 仅在简单的2.5D结构和有限复杂度下验证,未充分反映真实机器人环境中的空间感知复杂性。
  • �� 模型在空间关系推理中的表现受限于视觉空间地面化能力,缺乏深度信息和3D空间理解。
  • �� 多轮对话中的澄清和修正未能显著改善性能,表明模型在空间关系建模方面仍有根本性缺陷。

Plain Language Accessible to non-experts

想象你在一家工厂里工作,工厂里有很多不同的机器和零件。你的任务是把这些零件按照一定的规则放到正确的位置上,比如堆叠、排列成特定的形状。你有一个助手(就像机器人),他看不到全部,只能通过你的指示来操作。你告诉他“把红色的零件放在左边的第一排”,他会试着去做,但有时候会搞错,比如堆错位置或颜色。你们会通过多次对话,彼此确认和修正,直到结构正确。这个过程就像论文中的多轮对话,模型要理解空间关系、堆叠层次,还要根据你的描述不断调整。研究发现,这个助手虽然努力,但在理解空间关系和堆叠顺序方面仍有很大难度,尤其是在复杂的结构中。就像工厂里的工人还需要更多培训一样,模型也需要更聪明的空间理解能力,才能真正帮上大忙。

ELI14 Explained like you're 14

想象你在玩搭积木的游戏,你要把不同颜色和形状的积木堆成一个漂亮的城堡。你有一个朋友(就像机器人),他看不到你的积木堆,但你可以用话告诉他:“把红色的方块放在第一层,蓝色的圆形放在第二层。”你的朋友会试着用手去摆放,但有时候会搞错,比如把蓝色的圆形放在了第一层,或者忘记了颜色。你们会不断地对话,确认每一步,直到城堡搭得和你想象的一样。这就像论文里的多轮对话,模型要理解空间关系和堆叠顺序,还要根据你的描述不断调整。研究发现,这个过程虽然可以帮忙,但模型还是很难完全理解复杂的空间关系,特别是在堆得很高或者结构很复杂的时候。就像你和朋友需要多次沟通才能把城堡搭得完美一样,模型也需要更聪明的空间理解能力,才能帮你搭出最漂亮的城堡。

Abstract

Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models (VLMs) support robotic tasks involving visual interpretation, question answering, and instruction following, but their capabilities in collaborative dialogue tasks requiring spatial reasoning remain underexplored. We study this gap through a collaborative structure-building task that combines visual interpretation, grounding, language-guided interaction, and action generation. We develop a framework in which VLMs use dialogue to reconstruct a target structure from visual and textual inputs. We evaluate open-weight and closed VLMs across interaction settings, input modalities, and image representations. Results show that spatial reasoning over visual representations remains difficult for the evaluated VLMs. Detailed text representations of the target yield higher reconstruction success across modality conditions, while decomposed image representations improve performance. These findings reveal limits in visual spatial grounding and grounded instruction generation for collaborative VLM agents.

cs.CL cs.RO

References (20)

clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

2025 3 citations ⭐ Influential View Analysis →

Towards No-Code Programming of Cobots: Experiments with Code Synthesis by Large Code Models for Conversational Programming

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

2024 3 citations ⭐ Influential View Analysis →

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

2024 806 citations ⭐ Influential View Analysis →

BRAVE: Broadening the visual encoding of vision-language models

Ouguzhan Fatih Kar, A. Tonioni, Petra Poklukar et al.

2024 75 citations View Analysis →

Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions

Cunxin Fan, Xiaosong Jia, Yihang Sun et al.

2025 49 citations View Analysis →

clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

Chalamalasetti Kranti, Jana Gotze, Sherzod Hakimov et al.

2023 58 citations View Analysis →

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon et al.

2019 1080 citations View Analysis →

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Kanzhi Cheng, Wenpo Song, Jiaxin Fan et al.

2025 34 citations View Analysis →

Natural Language Communication with Robots

Yonatan Bisk, Deniz Yuret, D. Marcu

2016 127 citations

iVISPAR - An Interactive Visual-Spatial Reasoning Benchmark for VLMs

Julius Mayer, Mohamad Ballout, Serwan Jassim et al.

2025 23 citations View Analysis →

GuessWhat?! Visual Object Discovery through Multi-modal Dialogue

H. D. Vries, Florian Strub, A. Chandar et al.

2016 442 citations View Analysis →

CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models

Haoxu Huang, Fanqi Lin, Yingdong Hu et al.

2024 138 citations View Analysis →

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, Yuke Zhu, O. Groth et al.

2016 6512 citations View Analysis →

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Adrian S. Wong, Stefan Welker et al.

2022 725 citations View Analysis →

A Natural Language Corpus of Common Grounding under Continuous and Partially-Observable Context

Takuma Udagawa, Akiko Aizawa

2019 52 citations View Analysis →

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Peng Jin, Ryuichi Takanobu, Caiwan Zhang et al.

2023 433 citations View Analysis →

Learning to execute instructions in a Minecraft dialogue

Prashant Jayannavar, Anjali Narayan-Chen, J. Hockenmaier

2020 48 citations

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Wenlong Huang, P. Abbeel, Deepak Pathak et al.

2022 1572 citations View Analysis →

Code as Policies: Language Model Programs for Embodied Control

Jacky Liang, Wenlong Huang, F. Xia et al.

2022 1547 citations View Analysis →

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder et al.

2020 58691 citations View Analysis →