Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

TL;DR

Qwen2.5-VL excels in spatial reasoning for robot motion with 71.4% zero-shot accuracy.

cs.RO 🔴 Advanced 2026-03-13 2 views

Wenxi Wu Jingjing Zhang Martim Brandão

Vision-Language Models Spatial Reasoning Robot Planning Motion Preferences Zero-Shot Learning

Key Findings

Methodology

This paper employs a methodology combining Vision-Language Models (VLMs) and sampling-based motion planning algorithms to evaluate VLMs' capability in spatial reasoning over robot motion. Specifically, the Bidirectional Rapidly-exploring Random Trees (BiRRT) and Probabilistic RoadMaps (PRM) algorithms are used to generate diverse path candidates. The K-means clustering algorithm is then applied to group the paths, and VLMs are used to score the paths, selecting the one that best matches the user's description.

Key Results

Qwen2.5-VL achieved 71.4% zero-shot accuracy using the single-query method and 75% on a smaller model after fine-tuning. In contrast, GPT-4o showed lower performance.
In 126 navigation problems, Qwen2.5-VL achieved 74.4% accuracy on object-proximity issues and 63.9% on path-style issues.
In 432 manipulation problems, Qwen2.5-VL achieved 66.3% accuracy on object-proximity issues, while GPT-4o achieved 69.5% on path-style issues.

Significance

This study demonstrates the potential of integrating Vision-Language Models (VLMs) into robot motion planning pipelines, particularly in handling user preferences and motion constraints. This approach allows robots to better understand and execute complex user instructions, enhancing their generalization capabilities on new tasks, objects, and motion specifications. This has significant implications for human-robot interaction and automation, advancing the development of intelligent robotic systems.

Technical Contribution

The technical contribution of this paper lies in proposing a novel method that applies Vision-Language Models (VLMs) to robot motion planning to address motion preferences and constraints. Compared to existing methods, this approach better handles complex spatial relationships and user instructions, providing new theoretical guarantees and engineering possibilities. Additionally, the paper analyzes the trade-off between accuracy and computation cost, offering insights for future research.

Novelty

This study is the first to apply Vision-Language Models (VLMs) to spatial reasoning tasks in robot motion planning, particularly in handling motion preferences and constraints. Compared to previous work, this approach better understands and executes complex user instructions, showcasing the potential of VLMs in this field.

Limitations

In some scenarios, VLMs may fail to accurately recognize the length or complexity of paths, which is precisely the type of problem classical optimal planners (e.g., RRT*, PRM*) can efficiently solve.
VLMs may experience 'hallucination' in handling certain complex spatial relationships, selecting a candidate path that does not exist.
Although fine-tuning can improve model accuracy, it requires more data and computational resources.

Future Work

AI Executive Summary

In modern robotics, understanding user instructions and spatial relations of objects in the environment is crucial for robotic systems to assist humans in various tasks. However, existing foundational models applied in task planning still face limitations, especially in enforcing user preferences or motion constraints. To address this, this paper proposes a methodology combining Vision-Language Models (VLMs) and sampling-based motion planning algorithms to evaluate VLMs' capability in spatial reasoning over robot motion.

Specifically, the researchers evaluated four state-of-the-art VLMs using four different querying methods. The results show that Qwen2.5-VL achieves 71.4% zero-shot accuracy using the single-query method and 75% on a smaller model after fine-tuning. In contrast, GPT-4o showed lower performance. The study also evaluated two types of motion preferences (object-proximity and path-style) and analyzed the trade-off between accuracy and computation cost.

The findings indicate that VLMs have potential in handling complex spatial relationships and user instructions, particularly excelling in object-proximity issues compared to path-style issues. This provides a theoretical foundation and practical guidance for integrating VLMs into robot motion planning pipelines.

However, the study also identified some limitations. For instance, VLMs may fail to accurately recognize the length or complexity of paths in certain scenarios. Additionally, although fine-tuning can improve model accuracy, it requires more data and computational resources.

Future research directions include further improving VLMs' accuracy in complex spatial reasoning tasks and developing more efficient user interaction interfaces. Additionally, exploring the integration of VLMs with other advanced robot motion planning technologies could enhance their robustness and efficiency in practical applications. Through these efforts, intelligent robotic systems will be better equipped to understand and execute complex user instructions, advancing the fields of human-robot interaction and automation.

Deep Analysis

Background

With the rapid advancement of artificial intelligence, intelligent robotic systems are playing an increasingly important role in daily life and industrial production. To better assist humans in completing various tasks, robots need to have the ability to understand user instructions and spatial relations of objects in the environment. In recent years, Vision-Language Models (VLMs) have gained widespread attention for their potential in natural language understanding and visual reasoning. VLMs provide an intuitive interface for users to give instructions to robots by acquiring rich semantic knowledge from large-scale internet data. However, despite the application of foundational models in task planning, their capability in enforcing user preferences or motion constraints remains unclear. To address this, this paper proposes a methodology combining VLMs and sampling-based motion planning algorithms to evaluate VLMs' capability in spatial reasoning over robot motion.

Core Problem

In robot motion planning, understanding and executing user motion preferences and constraints is a key issue. Users may have specific preferences for motion paths, such as wanting the path to be straight, curved, or zigzag, or wanting the robot to move close to or away from a particular object. Existing foundational models face limitations in handling these complex spatial relationships and user instructions, making it difficult to meet user expectations. Therefore, there is an urgent need for a method that can effectively address these issues to enhance the robot's generalization capabilities on new tasks, objects, and motion specifications.

Innovation

The core innovation of this paper lies in applying Vision-Language Models (VLMs) to spatial reasoning tasks in robot motion planning, particularly in handling motion preferences and constraints. Specifically, the researchers propose a methodology combining VLMs and sampling-based motion planning algorithms to generate diverse path candidates and use VLMs to score the paths, selecting the one that best matches the user's description. Compared to previous work, this approach better understands and executes complex user instructions, showcasing the potential of VLMs in this field.

Methodology

The methodology of this paper includes the following key steps:

�� Use Bidirectional Rapidly-exploring Random Trees (BiRRT) and Probabilistic RoadMaps (PRM) algorithms to generate diverse path candidates.
�� Apply K-means clustering algorithm to group the paths and select the path closest to each cluster center for visualization.
�� Use Vision-Language Models (VLMs) to score the paths, selecting the one that best matches the user's description.
�� Evaluate four different querying methods to determine which method performs best in path selection.

Experiments

The experimental design includes generating a dataset of 558 language-constrained robot motion planning problems, with 126 navigation problems and 432 manipulation problems. Each problem consists of a virtual scene, a start and goal location, and a text description of the desired properties of the motion. The researchers manually selected start and goal locations to allow for diverse ways of traveling between the two. The experiments used several scenes from the iGibson simulation environment and evaluated three different VLMs: Qwen2.5-VL, GPT-4o, and LLaVa1.5.

Results

The experimental results show that Qwen2.5-VL achieved 71.4% zero-shot accuracy using the single-query method and 75% on a smaller model after fine-tuning. In 126 navigation problems, Qwen2.5-VL achieved 74.4% accuracy on object-proximity issues and 63.9% on path-style issues. In 432 manipulation problems, Qwen2.5-VL achieved 66.3% accuracy on object-proximity issues, while GPT-4o achieved 69.5% on path-style issues.

Applications

The methodology presented in this paper can be directly applied to motion planning tasks in intelligent robotic systems, particularly when dealing with complex user instructions and motion preferences. By integrating Vision-Language Models (VLMs) into robot motion planning pipelines, robots can better understand and execute complex user instructions, enhancing their generalization capabilities on new tasks, objects, and motion specifications. This has significant implications for human-robot interaction and automation, advancing the development of intelligent robotic systems.

Limitations & Outlook

Despite the promising results of this methodology in handling complex spatial relationships and user instructions, there are still some limitations. For instance, VLMs may fail to accurately recognize the length or complexity of paths in certain scenarios. Additionally, although fine-tuning can improve model accuracy, it requires more data and computational resources. Future research should focus on further improving VLMs' accuracy in complex spatial reasoning tasks and developing more efficient user interaction interfaces.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking, and you need an assistant to help you fetch things. You tell the assistant, 'Please place the salt shaker away from the pot.' The assistant needs to understand your instruction and decide how to move the salt shaker based on the kitchen's layout. Now, suppose this assistant is a robot. It needs to understand your language instruction and find a suitable path in the kitchen to complete the task. This is the problem discussed in this paper: how to enable robots to understand and execute complex user instructions, especially involving spatial relationships and motion preferences.

The researchers used a technology called Vision-Language Models (VLMs), which helps robots understand natural language instructions and make decisions by combining visual information. Through this method, robots can choose the most suitable path in different scenarios to complete the task. Just like in the kitchen, the robot can choose a path away from the pot to place the salt shaker based on your instruction.

To achieve this, the researchers used algorithms called Bidirectional Rapidly-exploring Random Trees (BiRRT) and Probabilistic RoadMaps (PRM) to generate various possible paths, then used VLMs to score these paths and select the one that best matches the user's description. This way, robots can better understand and execute complex user instructions, enhancing their generalization capabilities on new tasks, objects, and motion specifications.

This study demonstrates the potential of integrating VLMs into robot motion planning, particularly in handling user preferences and motion constraints. This has significant implications for human-robot interaction and automation, advancing the development of intelligent robotic systems.

ELI14 Explained like you're 14

Hey there! Have you ever thought about how cool it would be if robots could understand our instructions just like humans? For example, you want a robot to help you put a toy in a corner of the room, but you want it to avoid the table. The robot needs to know how to move without bumping into the table, right?

That's what scientists are working on! They're using a technology called Vision-Language Models (VLMs), which helps robots understand our language instructions and make decisions based on what they see. It's like when you're playing a game and have to decide your next move based on the map.

To make robots smarter, scientists also use some cool algorithms like Bidirectional Rapidly-exploring Random Trees (BiRRT) and Probabilistic RoadMaps (PRM) to generate different possible paths for the robot. Then, they let the robot choose the path that best matches our instructions. This way, the robot can complete tasks more effectively!

This research brings us one step closer to having smarter robots! In the future, robots might play an even bigger role in our lives, helping us with all sorts of tasks. Isn't that exciting?

Glossary

Vision-Language Models (VLMs)

Vision-Language Models are models that combine visual information and natural language processing to understand and generate natural language descriptions related to visual content.

In this paper, VLMs are used to understand user language instructions and visual information in the environment to select appropriate robot motion paths.

Bidirectional Rapidly-exploring Random Trees (BiRRT)

BiRRT is an algorithm used for path planning that grows two trees simultaneously from the start and goal to find a path. This method efficiently explores complex spaces.

BiRRT is used in this paper to generate diverse path candidates for VLMs to score and select.

Probabilistic RoadMaps (PRM)

PRM is a path planning algorithm that generates nodes by random sampling in the configuration space and connects these nodes to form paths.

In this paper, PRM is used to generate diverse path candidates for VLMs to score and select.

K-means Clustering

K-means clustering is an unsupervised learning algorithm that partitions data points into K clusters, with each data point belonging to the cluster with the nearest centroid.

K-means clustering is used in this paper to group generated paths and select representative paths for visualization.

Zero-Shot Learning

Zero-shot learning is a machine learning method that aims to enable models to make predictions on unseen classes.

This paper evaluates the ability of VLMs to select appropriate paths under zero-shot conditions.

Motion Preferences

Motion preferences refer to users' specific requirements for robot motion paths, such as the shape of the path or the distance from objects.

This paper investigates VLMs' ability to handle user motion preferences.

Path Style

Path style refers to the geometric shape of the path, such as straight, curved, or zigzag.

This paper evaluates VLMs' performance in selecting paths that match user path style descriptions.

Object Proximity

Object proximity refers to the distance relationship between the robot and objects in the environment during motion.

This paper studies VLMs' accuracy in handling object proximity issues.

iGibson

iGibson is a 3D interactive simulation environment for robot learning, containing scenes reconstructed from real homes.

iGibson is used in this paper to generate datasets for robot motion planning problems.

Fine-Tuning

Fine-tuning is a machine learning technique that involves further training a model on a specific task to improve its performance on that task.

Fine-tuning is used in this paper to improve VLMs' accuracy on specific motion planning tasks.

Open Questions Unanswered questions from this research

1 Although this paper demonstrates the potential of VLMs in handling user motion preferences, their performance in certain complex spatial relationships is still suboptimal. Future research needs to explore more advanced model architectures to improve performance in complex scenarios.
2 VLMs face limitations in handling path length and complexity, which may affect their application in certain tasks. New algorithms need to be developed to address this shortcoming.
3 Although fine-tuning can improve model accuracy, it requires more data and computational resources. Future research should explore more efficient fine-tuning methods to reduce computational costs.
4 In some cases, VLMs may experience 'hallucination,' selecting a candidate path that does not exist. This issue needs further investigation to improve model robustness.
5 While this methodology excels in handling complex user instructions, effectively integrating user feedback in practical applications remains an open question. More efficient user interaction interfaces need to be developed to enhance system practicality.

Applications

Immediate Applications

Home Service Robots

By integrating VLMs, home service robots can better understand user instructions and execute complex household tasks such as cleaning and item transportation.

Industrial Automation

In industrial settings, robots can select optimal paths based on worker instructions to execute complex assembly and transportation tasks, improving production efficiency.

Medical Assistance Robots

In medical environments, robots can select appropriate paths based on doctor instructions to perform complex medical operations such as drug delivery and surgical assistance.

Long-term Vision

Smart Cities

In smart cities, robots can execute complex urban service tasks such as waste collection and facility maintenance based on citizen instructions, improving city management efficiency.

Space Exploration

In space exploration, robots can select optimal paths based on scientist instructions to perform complex space missions such as sample collection and equipment maintenance.

Abstract

Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.

cs.RO cs.AI

References (20)

MotionGPT: Human Motion as a Foreign Language

Biao Jiang, Xin Chen, Wen Liu et al.

2023 502 citations ⭐ Influential View Analysis →

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown et al.

2022 2830 citations ⭐ Influential View Analysis →

LATTE: LAnguage Trajectory TransformEr

A. Bucker, Luis F. C. Figueredo, Sami Haddadin et al.

2022 84 citations View Analysis →

Intelligent bidirectional rapidly-exploring random trees for optimal motion planning in complex cluttered environments

A. H. Qureshi, Y. Ayaz

2015 225 citations View Analysis →

Task and Motion Planning with Large Language Models for Object Rearrangement

Yan Ding, Xiaohan Zhang, Chris Paxton et al.

2023 239 citations View Analysis →

Language-Grounded Dynamic Scene Graphs for Interactive Object Search With Mobile Manipulation

Daniel Honerkamp, Martin Buchner, Fabien Despinoy et al.

2024 91 citations View Analysis →

Open-vocabulary Queryable Scene Representations for Real World Planning

Boyuan Chen, F. Xia, Brian Ichter et al.

2022 243 citations View Analysis →

TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

Hanning Chen, Wenjun Huang, Yang Ni et al.

2024 31 citations View Analysis →

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Zhenhailong Wang, Manling Li, Ruochen Xu et al.

2022 167 citations View Analysis →

BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments

S. Srivastava, Chengshu Li, Michael Lingelbach et al.

2021 221 citations View Analysis →

iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks

Chengshu Li, Fei Xia, Roberto Mart'in-Mart'in et al.

2021 299 citations View Analysis →

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu et al.

2023 8387 citations View Analysis →

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

2024 640 citations View Analysis →

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

2024 3501 citations View Analysis →

Generating Human Motion from Textual Descriptions with Discrete Representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun et al.

2023 582 citations View Analysis →

ActivityNet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem et al.

2015 2870 citations

Text2Motion: from natural language instructions to feasible plans

Kevin Lin, Christopher Agia, Toki Migimatsu et al.

2023 379 citations View Analysis →

I Can Tell What I am Doing: Toward Real-World Natural Language Grounding of Robot Experiences

Wang, Zihan Wang

2024 15 citations View Analysis →

Probabilistic roadmaps for path planning in high-dimensional configuration spaces

L. Kavraki, P. Svestka, J. Latombe et al.

1996 4582 citations

LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions

Rumaisa Azeem, Andrew Hundt, Masoumeh Mansouri et al.

2024 24 citations View Analysis →

Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Vision-Language Models (VLMs)

Bidirectional Rapidly-exploring Random Trees (BiRRT)

Probabilistic RoadMaps (PRM)

K-means Clustering

Zero-Shot Learning

Motion Preferences

Path Style

Object Proximity

iGibson

Fine-Tuning

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Home Service Robots

Industrial Automation

Medical Assistance Robots

Long-term Vision

Smart Cities

Space Exploration

Abstract

References (20)

Related Papers

A Feasibility-Enhanced Control Barrier Function Method for Multi-UAV Collision Avoidance

SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

From Passive Monitoring to Active Defence: Resilient Control of Manipulators Under Cyberattacks

Route Fragmentation Based on Resource-centric Prioritisation for Efficient Multi-Robot Path Planning in Agricultural Environments

$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

HumDex:Humanoid Dexterous Manipulation Made Easy