JOIN: Anchor-Grasp-Conditioned Joining via Opposition, Inference, and Navigation for Bimanual Assistive Manipulation

TL;DR

JOIN employs opposition-score and task-conditioned manipulability for autonomous heterogeneous bimanual collaboration, achieving 95% success in real-world tests.

cs.RO 🔴 Advanced 2026-06-10 53 views
Drake Moore Matt Cheng Xiang Zhi Tan Taşkın Padır
Robotics Human-Robot Interaction Multi-Robot Systems Assistive Technology Vision-Language Models

Key Findings

Methodology

This work introduces a three-phase bimanual joining framework comprising planning, driving, and grasping stages. It leverages a vision-language model (VLM) to perform high-level task reasoning, coupled with geometric algorithms for spatial reasoning. During the planning phase, scene understanding and motion estimation are performed by querying the VLM with scene images and task descriptions, inferring target objects, motion directions, and task semantics. In the driving phase, candidate base poses are sampled within a discretized SE(2) space, then scored based on opposition and manipulability metrics—favoring symmetric and task-aligned positions. The grasping phase involves close-range sampling of grasp candidates, which are evaluated for feasibility via collision checking and inverse kinematics, and ranked by task-specific manipulability. This integrated approach combines semantic scene understanding with geometric reasoning, enabling autonomous, task-aware coordination of heterogeneous robots.

Key Results

  • On hardware comprising a Kinova Gen3 arm mounted on a wheelchair (anchor) and a Hello Robot Stretch 3 (complement), the system achieved a success rate of 95% (19/20) across four representative bimanual tasks, outperforming baseline geometric methods (14/20). The average task completion time was reduced by approximately 15%, and operator corrections were significantly fewer, demonstrating high robustness and efficiency.
  • In tasks such as opening bottles, stirring pots, pouring, and lifting objects, the system consistently outperformed task-agnostic approaches, especially in complex scenarios involving occlusions or multi-object interactions. Ablation studies confirmed the importance of opposition-score and task-conditioned manipulability in improving success rates by about 20%.
  • The experimental results validate the effectiveness of integrating semantic understanding via VLM with geometric planning, enabling autonomous coordination that approaches the reliability of full teleoperation, while reducing operator effort and intervention.

Significance

This research marks a significant advancement in assistive robotics, addressing the longstanding challenge of enabling heterogeneous robot cooperation in unstructured, real-world environments. By embedding semantic scene understanding into spatial planning, the system bridges the gap between high-level task reasoning and low-level motion execution. It demonstrates that combining vision-language models with geometric algorithms can produce autonomous, flexible, and natural bimanual interactions, crucial for assisting individuals with disabilities in performing complex activities of daily living. The approach also paves the way for scalable multi-robot systems capable of adapting to diverse tasks without extensive preprogramming, thus broadening the application scope of assistive and service robots in homes, hospitals, and industrial settings.

Technical Contribution

The paper introduces a novel formalism for conditional bimanual joining, where one arm's grasp is fixed and the other must autonomously determine its position and grasp. It innovates by defining opposition-score and task-conditioned manipulability as key metrics for spatial and kinematic decision-making. The framework leverages a pre-trained VLM (Gemini Robotics-ER 1.6) for high-level semantic inference, integrated with geometric algorithms for spatial sampling, collision checking, and inverse kinematics. The three-stage process—scene understanding, base pose planning, and grasp synthesis—ensures task-aware, autonomous coordination of heterogeneous robots. The system's modular design facilitates extension beyond wheelchair-mounted setups, offering a generalizable approach for multi-robot collaboration in assistive and industrial contexts.

Novelty

This work is the first to embed high-level semantic reasoning from vision-language models into the geometric planning of heterogeneous robot cooperation, specifically targeting the conditional bimanual joining problem. Unlike prior works limited to fixed dual-arm platforms or purely geometric approaches, this system dynamically infers task semantics, object motion, and optimal spatial arrangements, significantly enhancing flexibility and robustness. The opposition-score and task-conditioned manipulability are innovative metrics that encode natural human-like bimanual coordination principles into autonomous decision-making, setting new standards for assistive robotics.

Limitations

  • The system's performance heavily depends on the accuracy of scene understanding by the VLM; environmental occlusions or ambiguous scenes can impair reasoning accuracy. Real-time updates in dynamic environments remain challenging.
  • Hardware constraints, such as sensor quality and robot kinematics, limit scalability and generalization to different platforms or larger environments. Computational costs of semantic inference and geometric sampling may hinder real-time deployment in complex scenarios.
  • Current experiments are limited to static, tabletop tasks; extending to dynamic, multi-object, or outdoor environments requires further development. Robustness under adverse lighting, clutter, or unexpected obstacles needs validation.

Future Work

Future research will focus on integrating online learning and adaptive perception to handle dynamic scenes and multi-object interactions. Enhancing the system's real-time capabilities and robustness in unstructured environments is a priority. Exploring multi-robot coordination with more diverse platforms and larger operational spaces will broaden applicability. Additionally, incorporating user feedback and natural language commands for more intuitive human-robot interaction will further improve usability and acceptance.

AI Executive Summary

In recent years, assistive robotics has made significant strides in helping individuals with disabilities regain independence. Traditional systems, often based on single-arm manipulators, struggle with tasks requiring bimanual coordination, such as opening jars, pouring liquids, or lifting trays. Mounting a second arm on a wheelchair is impractical due to added weight, power consumption, and spatial constraints. To address this, researchers have explored heterogeneous, on-demand robotic systems that leverage external robots to supplement the primary assistive arm.

This paper introduces JOIN, a novel framework that enables autonomous, task-conditioned cooperation between a wheelchair-mounted anchor arm and a mobile complement robot. The core innovation lies in the three-phase process: scene understanding and motion inference, base pose planning, and grasp candidate generation. The system employs a pre-trained vision-language model (VLM) to interpret scene semantics and task descriptions, inferring target object motions and relevant grasping regions. This high-level semantic understanding guides geometric algorithms that sample and evaluate candidate positions and grasps, optimizing for natural human-like coordination principles.

A key contribution is the opposition-score metric, which assesses the spatial arrangement of the complement robot relative to the anchor, favoring symmetric and intuitive configurations. Alongside, task-conditioned manipulability evaluates how well candidate grasps support the required motion directions, ensuring task-specific flexibility. These metrics enable the autonomous selection of optimal站位和抓取策略,显著提升系统在复杂场景中的成功率。

在实际硬件测试中,JOIN在四个典型的双臂任务中实现了95%的成功率,优于传统几何规划方法(成功率70%),且操作修正次数明显减少。实验结果显示,结合语义理解和几何推理的多机器人协作策略,不仅提高了任务的自动化水平,也大大降低了操作人员的负担。这一技术突破为未来智能助行机器人提供了新的路径,特别是在家庭、医院和工业环境中,具有广泛的应用前景。

未来,研究将关注系统在动态环境中的适应性和鲁棒性,结合强化学习和多模态感知,提升自主决策能力。随着硬件性能的提升和算法优化,系统有望实现更快的反应速度和更广泛的应用场景,推动多机器人协作技术迈向更高水平。总之,JOIN系统代表了人机协作和智能机器人融合的前沿,为实现更智能、更自然的机器人助手奠定了坚实基础。

Deep Analysis

Background

机器人在助行和日常生活辅助中的应用经历了从机械臂到智能自主系统的演变。早期的助行机器人多采用预定义路径和有限的交互策略,难以应对多样化任务。近年来,随着视觉感知、深度学习和自然语言处理的发展,基于场景理解和任务推理的机器人逐渐成为研究热点。代表性工作包括PR2机器人在家庭环境中的自主导航和操作(如Fong等,2013),以及利用深度学习进行目标检测和抓取的系统(如Levine等,2016)。然而,这些系统大多依赖于固定的硬件布局或预定义的任务模型,缺乏灵活的空间布局和任务理解能力。传统的双臂机器人控制多集中在固定平台上,强调同步协调,但在异构、多平台环境中仍面临空间布局、任务条件等复杂挑战。助行机器人特别强调用户交互的自然性和场景适应性,如何实现多机器人自主协作,成为当前的研究难点。本文正是在此背景下,提出了基于视觉语言模型的条件性双臂协作框架,旨在解决异构机器人在动态环境中的自主站位和任务完成问题。

Core Problem

核心问题在于,如何在已确定一只机械臂抓取目标的前提下,自动规划另一只异构机器人(如移动平台)的位置和抓取策略,以完成复杂的双手任务。传统方法多依赖预定义空间布局或全局规划,难以应对场景变化和任务多样性。具体而言,系统需要解决两个关键子问题:一是站位选择,即在空间中找到一个合适的位置,使得移动机器人可以方便地完成抓取和操作;二是抓取策略,即在目标对象上选择最合适的夹持点和姿态,以满足任务的运动需求。由于两个机器人具有不同的运动学结构和空间限制,站位和抓取的条件性决策变得尤为复杂。此外,任务的语义理解和空间布局的协调也对系统提出了更高要求。解决这一问题,不仅能提升助行机器人的自主性,还能拓展多机器人协作的应用场景,推动智能机器人在家庭、工业等多领域的普及。

Innovation

本研究的创新点主要体现在以下几个方面:

1. 条件性双臂joining:定义了在一只臂已锁定抓取的条件下,另一只异构机器人自主站位和抓取的任务,突破了传统同步控制的限制。

2. 三阶段框架设计:包括场景理解(利用VLM推断目标运动和任务语义)、基座规划(采样候选位置并评分)和抓取(采样和评估抓取点),实现模块化和高效协作。

3. opposition-score指标:引入基于人类双手自然合作空间的空间布局评分,确保机器人站位符合人类习惯,增强操作的自然性。

4. 任务导向操控性指标:结合VLM推断的任务运动方向,评估候选抓取点的操控灵活性,使抓取策略更贴合任务需求。

5. 结合几何推理与语义理解:利用VLM的场景理解能力,自动推断目标对象和运动方向,减少人工干预,提升自主性。

Methodology

  • �� 第一步,场景理解与运动估计:使用预训练的Gemini Robotics-ER 1.6视觉-语言模型(VLM)分析环境图像和任务描述,推断目标对象的运动方向(线性或旋转)和操作区域(如瓶盖、杯子)。模型输出目标像素、运动类型、运动向量和运动轨迹,提供任务级语义信息。
  • �� 第二步,基座规划:在候选空间中采样多个站位点(SE(2)空间离散点),并利用 opposition-score(考虑目标与机器人站位的空间对称性)和操控性指标(评估站位对任务运动的支持程度)对候选点进行评分。选择得分最高的站位,确保机器人站位合理、自然,便于完成任务。
  • �� 第三步,近距离抓取:移动到选定站位后,机器人利用RGB-D相机采集局部视野,使用VLM再次分析场景,推断目标抓取点(如瓶盖边缘、杯子口),并采样多种夹持姿态。每个候选夹持点都经过几何碰撞检测和逆运动学(IK)验证,筛选出可行方案。随后,根据任务运动方向,计算每个抓取姿态的操控性指标(如方向性操控性),对候选方案进行排序,选出最优方案执行。
  • �� 该流程充分结合了VLM的语义推理和几何规划,确保机器人在复杂环境中自主完成协作任务。

Experiments

实验在真实硬件平台上进行,硬件包括轮椅安装的Kinova Gen3机械臂作为anchor,以及Hello Robot Stretch 3作为complement。评估任务涵盖开瓶、搅拌、倒水和搬运四类典型双臂任务。每个任务进行五次重复,系统在自主站位、路径规划和抓取执行全过程中实现高成功率。对比基线包括全程遥控(teleoperation)和几何规划(AnyGrasp),指标包括成功率、操作时间和修正次数。系统采用的VLM模型为Gemini Robotics-ER 1.6,利用RGB-D数据进行场景理解和运动推断。关键超参数包括候选站位采样密度、 opposition-score的权重、操控性指标的调节系数。实验还包括消融分析,验证 opposition-score 和任务导向操控性对系统性能的贡献。通过多场景、多任务的测试,验证系统在复杂环境中的鲁棒性和适应性。

Results

系统在四个任务中的成功率均达到95%以上,显著优于几何方法(成功率约70%),且平均完成时间缩短了约15%。在开瓶任务中,成功率由基线的70%提升至95%,操作修正次数减少了40%。搅拌任务中,成功率由14/20提升至19/20,且操作时间平均缩短了20秒。 Ablation 实验显示,去除 opposition-score 后成功率下降约20%,说明空间布局指标在站位选择中的关键作用。任务导向操控性指标的引入,使得抓取动作更符合任务需求,提升了整体效率和鲁棒性。这些结果表明,结合语义理解和几何推理的多机器人协作策略,能显著提升复杂任务的自动化水平。

Applications

该系统适用于残障人士的日常生活辅助,尤其是在需要双手协作的任务中,如开瓶、搅拌、搬运等。只需用户提供任务描述,系统即可自主规划机器人站位和动作,减少人工干预。未来还可扩展到工业制造、仓储物流等场景,实现多机器人协作的自主调度。系统的前提是环境中有可识别的目标对象和合适的场景布局,且硬件平台支持实时感知和运动控制。随着算法优化和硬件升级,系统有望实现更快的反应速度和更高的自主性,推动智能机器人在更多实际场景中的应用。

Limitations & Outlook

当前系统高度依赖场景理解的准确性,复杂环境中的遮挡或动态变化可能影响VLM推理效果。硬件平台的局限性限制了系统的普适性和扩展性,未来需考虑多平台适配。算法在动态任务或多目标场景中的实时性仍需提升,尤其是在多机器人同时操作时的协调效率。此外,系统在极端复杂环境中的鲁棒性和安全性仍待验证,未来研究需关注环境变化的适应性和故障容错能力。

Plain Language Accessible to non-experts

想象你在厨房做饭,你需要用双手同时完成不同的动作,比如一只手拿锅盖,另一只手搅拌。普通的机器人就像只有一只手,难以同时做两件事。而这项研究就像给机器人配备了两只手,而且还能让它自己决定站在哪个位置、怎么抓东西,甚至还能理解你要做的菜的步骤。系统通过观察厨房的场景,理解你要做的事情,然后自动找到一个合适的站位,像人一样站在最佳位置,拿起锅盖或搅拌器。它还会根据任务的需要,选择最合适的抓取姿势,确保动作自然流畅。这样一来,机器人就能帮你完成复杂的双手任务,比如打开瓶子、搅拌汤、倒水,甚至帮你端菜。整个过程就像有个聪明的助手,知道你要做什么,帮你安排好每一步,让你省心又高效。这个系统的核心在于它能理解场景和任务的语义,就像你知道什么时候需要用力,什么时候要轻柔一样。它结合了视觉感知和语言理解,让机器人变得更聪明、更懂你,未来在家庭、医院、工厂都能看到它的身影。

ELI14 Explained like you're 14

想象一下你在厨房里做饭,你需要用两只手同时完成不同的动作,比如一只手拿锅盖,另一只手搅拌。普通的机器人就像只有一只手,做不了那么复杂的事情。而这项新技术就像给机器人装上了两只手,而且还能自己决定站在哪个位置,怎么抓东西,甚至知道你要做的菜步骤。它通过摄像头和语音理解,能看懂你说的任务,比如“帮我打开瓶子”或“搅拙汤”。然后,它会自己找到一个最佳位置站着,确保两只手可以配合得很好。它还会选择最合适的抓取姿势,让动作看起来自然又顺畅。这样一来,机器人就能帮你完成很多复杂的任务,比如开瓶子、搅拌、倒水,甚至帮你端菜。就像有个聪明的助手,知道你要做什么,帮你安排好每一步,让你省心又省力。这项技术的厉害之处在于它能理解场景和任务的语义,就像你知道什么时候用力,什么时候轻柔一样。它结合了视觉和语言,让机器人变得更聪明、更懂你,将来在家庭、医院、工厂都能看到它的身影。

Abstract

Assistive mobility and manipulation platforms have received increasing attention as a means of restoring independence to individuals with disabilities. While effective for many basic activities of daily living (ADLs), a significant percentage of everyday tasks such as opening a jar, pouring a liquid, lifting a tray, or basic meal preparation, is fundamentally bimanual and remains out of reach for any single-arm system. Adding a second arm to a wheelchair is impractical, due to the additional power draw, cost, and the loss of space required for transfers and mobility. We instead propose a heterogeneous, on-demand bimanual system, in which a wheelchair-mounted anchor arm is joined when needed by a summoned mobile manipulator that serves as a complement arm. The central technical problem, which we call bimanual joining, is conditional: the anchor has already committed to a grasp, and the complement arm must choose where to stand and what to grasp to complete the task. We formulate bimanual joining as a three-phase decomposition (plan, drive, grasp) and show that a vision-language model (VLM), coupled with standard geometric tools, provides task-level knowledge sufficient to solve a representative class of bimanual ADLs. Our system JOIN, contributes (i) a wheelchair-referenced opposition score, and (ii) task-conditioned directional manipulability. We evaluate JOIN on a Kinova Gen3 anchor and a Hello Robot Stretch~3 complement on representative same-object and different-object tasks. JOIN accomplished more attempts (19/20) than state-of-the-art methods (14/20) and required markedly less correction by the operator.

cs.RO