GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

TL;DR

GesVLA integrates gesture into Vision-Language-Action models, achieving 94.3% target grounding accuracy in complex real-world tasks.

cs.RO 🔴 Advanced 2026-05-22 56 views
Wenxuan Guo Ziyuan Li Meng Zhang Yichen Liu Yimeng Dong Chuxi Xu Yunfei Wei Ze Chen Erjin Zhou Jianjiang Feng
Vision-Language-Action Gesture Recognition Robotics Manipulation Multimodal Fusion Synthetic Data Generation

Key Findings

Methodology

This paper introduces GesVLA, a gesture-aware Vision-Language-Action (VLA) model featuring a dual-VLM architecture that tightly fuses gesture and language modalities. Gesture features, extracted from hand keypoints via MediaPipe, are encoded directly into continuous latent tokens using an MLP-based embedding, enabling their participation in both high-level intent reasoning and low-level action generation. The model decouples intent reasoning (VLMint) from online perception and action generation (VLMper and a flow-based action expert), with cross-attention facilitating latent interaction. To address data scarcity, a scalable semi-synthetic gesture dataset is constructed by rendering articulated hand models onto real-world scene images, providing diverse motion patterns and precise 3D pointing annotations. Training proceeds in two stages: pretraining VLMint on semi-synthetic data for gesture-conditioned intent reasoning, followed by training VLMper and the action expert on real robot demonstrations for action prediction.

Key Results

  • In real-world intent reasoning tests over 88 scenes, GesVLA's VLMint achieves 94.3% accuracy, outperforming a geometric pipeline baseline by 35.2% and a prompted multimodal large language model by 55.7%.
  • In robotic manipulation tasks including block pick-and-place, jelly cup selection, and fruit/vegetable sorting, GesVLA attains an average success rate of 83.3%, significantly surpassing text-only VLA (31.7%) and geometric pipeline augmented VLA (41.7%).
  • Ablation studies reveal that removing the gesture MLP embedding reduces accuracy to 84.1%, omitting data augmentation drops it to 89.8%, and disabling coordinate jitter causes a severe decline to 42.0%, highlighting the critical role of gesture encoding and data diversity.

Significance

GesVLA addresses a fundamental limitation of existing VLA models that rely solely on textual instructions, which struggle with spatial ambiguity in cluttered scenes. By introducing gesture as a first-class instruction modality and tightly integrating it with language and vision, GesVLA significantly enhances robots' ability to ground targets and execute tasks accurately in complex environments. This advancement not only improves human-robot interaction naturalness and efficiency but also opens new avenues for multimodal instruction following in robotics, bridging a critical gap between human communication modalities and robotic perception.

Technical Contribution

Key technical contributions include: 1) a novel latent embedding scheme for gestures enabling seamless multimodal fusion with language and vision; 2) a dual-VLM architecture that decouples intent reasoning and action generation while maintaining tight latent interaction via cross-attention; 3) a scalable semi-synthetic gesture data generation pipeline that synthesizes diverse hand trajectories with precise 3D pointing annotations on real scene backgrounds, mitigating sim-to-real gaps; 4) a two-stage training strategy that effectively transfers gesture-conditioned spatial reasoning learned from synthetic data to real robot action generation, achieving robust sim-to-real performance.

Novelty

GesVLA is the first to treat gesture as a primary instruction modality alongside language in VLA models, encoding gesture features directly into continuous latent tokens rather than converting gestures to discrete text or using them as auxiliary inputs. Its dual-VLM design and semi-synthetic data generation pipeline uniquely solve the challenge of tightly coupling gesture perception with action policy learning, enabling robust spatial disambiguation and improved task execution in cluttered real-world scenes.

Limitations

  • The model's performance degrades under severe occlusion or failure in hand keypoint detection, as it relies on MediaPipe which has limited robustness in challenging visual conditions.
  • While the semi-synthetic data reduces the sim-to-real gap, it cannot fully capture the diversity of real-world hand appearances and motion variations, limiting generalization.
  • Current experiments focus on pointing gestures, lacking coverage of richer gesture vocabularies, thus constraining interaction complexity and expressiveness.

Future Work

Future directions include developing more robust hand keypoint detection methods leveraging multi-view inputs to improve gesture recognition under occlusion. Expanding the data generation pipeline to encompass a broader range of gesture types and more complex motion patterns will enhance interaction diversity. Additionally, applying GesVLA to dynamic environments and multi-robot collaborative tasks will further validate and extend its applicability, advancing toward practical deployment in real-world human-robot interaction scenarios.

AI Executive Summary

Robotic manipulation has recently benefited from Vision-Language-Action (VLA) models that unify perception, language understanding, and action generation, enabling robots to interpret natural language instructions and perform complex tasks. However, existing VLA systems primarily depend on textual commands, which often fail to resolve spatial ambiguity in cluttered environments with multiple similar objects. This limitation hinders precise target grounding and task execution.

To overcome this, the authors propose GesVLA, a gesture-aware VLA model that incorporates hand gestures as a parallel instruction modality alongside language and vision. The model employs a dual-VLM architecture: one VLM (VLMint) performs gesture-conditioned intent reasoning, while another (VLMper) handles online perception and action generation. Gesture features are extracted from hand keypoints and embedded directly into continuous latent tokens, enabling seamless multimodal fusion without converting gestures into discrete text.

A key innovation is the scalable semi-synthetic gesture data generation pipeline, which renders articulated hand models onto real-world scene images, producing diverse gesture trajectories with precise 3D pointing annotations. This approach mitigates the sim-to-real gap and provides rich supervision for training. The training is conducted in two stages: pretraining VLMint on the semi-synthetic dataset for spatial reasoning, followed by training VLMper and a flow-based action expert on real robot demonstrations for action prediction.

Extensive experiments demonstrate that GesVLA significantly outperforms baselines in both intent reasoning and robotic manipulation tasks. In 88 real-world scenes, VLMint achieves 94.3% accuracy in target grounding, surpassing geometric and language-only baselines by large margins. In manipulation tasks involving block pick-and-place and object selection, GesVLA attains an 83.3% success rate, compared to 31.7% for text-only VLA. Ablation studies confirm the importance of gesture embedding, data augmentation, and the two-stage training strategy.

GesVLA advances the state of the art by enabling robots to interpret multimodal human instructions more naturally and accurately, particularly in complex and cluttered environments. This work paves the way for more intuitive human-robot collaboration and has broad implications for service robotics, industrial automation, and assistive technologies.

Despite these achievements, challenges remain, including robustness to occlusions in hand detection, coverage of diverse gesture vocabularies, and computational efficiency. Future work aims to address these limitations and extend GesVLA to dynamic and multi-agent settings, further bridging the gap between human communication and robotic action.

Deep Analysis

Background

In recent years, robotic manipulation has increasingly leveraged Vision-Language-Action (VLA) models that integrate visual perception, natural language understanding, and action generation into a unified framework. Notable works such as PaLM-E and SayCan have demonstrated the potential of large-scale vision-language pretrained models to enable robots to follow open-world instructions and perform complex tasks. These models typically adopt hierarchical architectures combining high-level planning with low-level policy execution. Despite these advances, existing VLA systems predominantly rely on textual instructions as the sole modality for conveying human intent. This reliance presents challenges in real-world scenarios where language alone is insufficient to disambiguate spatial references, especially in cluttered scenes with multiple similar objects. Humans naturally use deictic gestures like pointing to resolve such ambiguities, but current VLA models rarely incorporate gestures as a primary input modality. Prior works often treat gestures as auxiliary signals or convert them into text, resulting in information loss and limited spatial grounding precision. Moreover, large-scale datasets aligning gesture, language, and action with precise spatial annotations are scarce due to the high cost and difficulty of manual collection and labeling. These limitations motivate the development of models that tightly integrate gesture with language and vision, supported by scalable data generation pipelines.

Core Problem

The core problem addressed is the spatial ambiguity in robotic instruction following when relying solely on language. Ambiguous commands such as “pick up this one” or “place it there” fail to uniquely specify targets in environments with multiple similar objects. Existing VLA models lack mechanisms to incorporate spatially precise gesture cues, limiting their ability to ground instructions accurately. Challenges include: 1) How to represent and fuse gesture information with language and vision in a unified model without losing spatial detail; 2) How to obtain sufficiently large and diverse training data with accurate gesture annotations to enable robust learning; 3) How to design model architectures that effectively leverage gesture cues for both intent reasoning and action generation. Addressing these challenges is crucial for enabling robots to understand and execute instructions reliably in complex, cluttered real-world settings.

Innovation

The paper introduces several key innovations: 1) Gesture as a first-class instruction modality: Unlike prior works that treat gestures as auxiliary or convert them to text, GesVLA encodes gesture features directly into continuous latent tokens, preserving spatial precision and enabling deep multimodal fusion. 2) Dual-VLM architecture: The model separates intent reasoning (VLMint) from online perception and action generation (VLMper and action expert), connected via cross-attention that allows latent interaction without intermediate discretization, improving efficiency and robustness. 3) Semi-synthetic gesture data engine: By rendering articulated hand models onto real scene images with precise 3D pointing annotations, the pipeline generates diverse, scalable gesture-language-action data, mitigating the sim-to-real gap and enabling effective pretraining. 4) Two-stage training strategy: Pretraining VLMint on semi-synthetic data for gesture-conditioned spatial reasoning, followed by freezing VLMint and training VLMper and the action expert on real robot demonstrations, ensures robust sim-to-real transfer and improved task performance.

Methodology

  • �� Input modalities: The model receives RGB-D visual observations, natural language instructions, and gesture video sequences. Gesture videos are processed to extract keyframes based on hand motion dynamics.

  • �� Gesture embedding: Using MediaPipe, four keypoints per keyframe (wrist and three index finger joints) are extracted as (x,y,depth) coordinates, concatenated into 12-dimensional vectors, and projected into latent space via a multi-layer perceptron (MLP).

  • �� Dual-VLM architecture:
  • VLMint performs multimodal intent reasoning by jointly modeling gesture embeddings and language instructions, outputting textual and visual reasoning representations.
  • VLMper takes scene observations and attends to VLMint’s latent states through cross-attention, producing latent representations for action prediction.

  • �� Action expert: A flow-based policy iteratively denoises an initial noise vector conditioned on VLMper’s latent representation and current robot state to generate smooth, continuous action trajectories.

  • �� Data generation pipeline:
  • GroundingDINO detects candidate objects in real RGB-D scenes.
  • Targets are randomly sampled; their 3D coordinates are computed using depth maps and camera intrinsics.
  • Hand pointing motions are synthesized by interpolating hand poses from random initial positions toward targets, incorporating parabolic lifting for naturalness.
  • Hand meshes are rendered onto real scene images to produce semi-synthetic gesture videos paired with language instructions and precise spatial annotations.

  • �� Training:
  • Stage 1: VLMint is pretrained on the semi-synthetic dataset using teacher-forced autoregressive cross-entropy loss to jointly learn semantic reasoning and spatial grounding.
  • Stage 2: VLMint is frozen; VLMper and the action expert are trained on real robot demonstration data using a flow matching objective to optimize action generation.

Experiments

The experimental evaluation consists of two main parts: intent reasoning and robotic manipulation. Intent reasoning is tested on 88 real-world scenes with varying object counts and arrangements. The fixed instruction “Pick this up and put it there” is used, and the model must correctly identify both the pointed block and plate. Three cameras capture the scene: two global views and one mounted on the robot gripper. Robotic manipulation tasks include: 1) Pick-and-Place Block — grasping a specified block from clutter and placing it on a plate; 2) Select Jelly — sequentially picking jelly cups from plates in indicated order; 3) Select Fruit/Vegetable — sequentially picking bananas from bins. Each task is repeated 20 times under simple (few objects) and hard (multiple objects) conditions. Baselines include text-only VLA, prompted multimodal large language models combined with VLA, geometric pipeline plus VLA, and a decoupled VLM variant of GesVLA. Ablation studies analyze the impact of gesture encoding, data augmentation, training strategies, and prompt design. Training uses AdamW optimizer with cosine learning rate scheduling on RTX 4090 GPUs.

Results

GesVLA’s VLMint achieves 94.3% accuracy on the intent reasoning test set, outperforming the geometric pipeline baseline (59.1%) and prompted multimodal LLM baseline (38.6%) by large margins. In robotic manipulation, GesVLA attains an average success rate of 83.3%, significantly higher than text-only VLA (31.7%) and geometric pipeline augmented VLA (41.7%). Ablations show that removing the gesture MLP embedding reduces intent reasoning accuracy to 84.1%, omitting data augmentation drops it to 89.8%, and disabling coordinate jitter causes a severe accuracy decline to 42.0%. Two-stage training with frozen VLMint outperforms joint training, indicating the importance of staged optimization. Visual prompts are critical for action generation, while textual prompts add no significant benefit. These results demonstrate the effectiveness of tightly integrating gesture and language in latent space for robust spatial grounding and action execution.

Applications

GesVLA is applicable in multimodal human-robot interaction scenarios requiring precise spatial grounding, especially in cluttered or complex environments. Use cases include warehouse picking where workers can use gestures and language to direct robots, service robots assisting in homes or healthcare settings with natural hand-gesture commands, and industrial assembly lines where operators guide robots via combined gesture-language instructions to improve accuracy and flexibility. The system relies on multi-camera setups and depth sensing, suitable for structured or semi-structured environments. Future extensions may enable multi-robot collaboration and dynamic task execution in unstructured settings.

Limitations & Outlook

The model depends on MediaPipe for hand keypoint detection, which suffers from reduced robustness under occlusion or challenging lighting, leading to degraded gesture recognition. Although the semi-synthetic data generation pipeline reduces the sim-to-real gap, it cannot fully capture the diversity of real-world hand appearances and motions, limiting generalization. Current experiments focus primarily on pointing gestures, lacking coverage of richer gesture vocabularies needed for complex interactions. Additionally, the flow-based action generation incurs computational overhead, impacting real-time performance. Addressing these limitations requires improved hand detection algorithms, expanded gesture datasets, and optimized model architectures.

Plain Language Accessible to non-experts

Imagine you’re in a kitchen and you want your robot helper to grab a red apple. You say “grab that apple,” but there are several red apples on the table, so the robot isn’t sure which one you mean. If you point at the apple, the robot instantly knows which one to pick. GesVLA teaches robots to understand both your words and your hand gestures together, so they can find exactly what you want.

The robot’s cameras watch your hand movements and turn the positions of your wrist and finger joints into numbers it can understand. To teach the robot, researchers created lots of videos by digitally placing a hand model onto real kitchen photos, showing different pointing motions and labeling exactly what the hand points at.

First, the robot learns how to interpret your gestures combined with your words. Then, it learns how to move its arm to pick up the right object. Tests show that the robot can correctly identify and grab the right item, especially when the table is messy with many similar objects. This is much better than robots that only listen to words.

It’s like talking to a friend: you don’t just say “that one,” you also point. GesVLA helps robots understand both, making human-robot teamwork smoother and more natural.

ELI14 Explained like you're 14

Hey! Imagine you’re playing a game and want your robot buddy to grab something for you. You say “grab that,” but there are lots of things that look the same, so your robot gets confused! But if you point at it with your finger, your robot instantly knows which one you mean.

This paper is about teaching robots to watch your hand gestures and listen to your words at the same time. The researchers made tons of training videos by putting a computer-generated hand onto real pictures, showing how hands point to things.

The robot first learns to understand what you mean by combining your words and pointing. Then it learns how to move and grab the right thing. The results show the robot is way better at finding the right stuff, especially when the table is messy, compared to robots that only listen to words.

So next time you want your robot to help, don’t forget to point! It’ll make your robot super smart and helpful!

Glossary

Vision-Language-Action Model

A robotic model that unifies visual perception, language understanding, and action generation to interpret instructions and perform tasks.

GesVLA is a VLA model enhanced with gesture modality.

Gesture Embedding

The process of encoding hand gesture keypoint data into continuous latent vectors for multimodal fusion.

GesVLA uses an MLP to embed MediaPipe-extracted hand keypoints.

Dual-VLM Architecture

An architecture with two Vision-Language Models, one for intent reasoning and another for perception and action, connected via cross-attention.

Core design enabling tight gesture-language interaction in GesVLA.

Flow-based Action Generation

An iterative denoising method using flow matching to generate smooth continuous robot action trajectories.

Used by GesVLA’s action expert module.

Semi-synthetic Gesture Dataset

A dataset created by rendering synthetic hand gestures onto real scene images with precise 3D pointing annotations.

Used to pretrain GesVLA’s intent reasoning module.

MediaPipe

Google’s open-source framework for real-time hand keypoint detection.

Used to extract hand keypoints for gesture embedding.

GroundingDINO

A vision-language object detection model used to identify candidate objects in scenes.

Used in GesVLA’s data generation pipeline.

Cross-attention

An attention mechanism allowing one model component to attend to the latent representations of another.

Connects VLMper to VLMint in GesVLA.

Teacher-forced Autoregressive Cross-entropy

A loss function for training sequence models by predicting the next token conditioned on previous ground truth tokens.

Used to train VLMint for intent reasoning.

Sim-to-real Gap

The performance drop when transferring models trained in simulation to real-world environments.

Mitigated by GesVLA’s semi-synthetic data pipeline.

Open Questions Unanswered questions from this research

  • 1 Improving robustness of hand keypoint detection under occlusion and challenging lighting remains an open challenge, as current reliance on MediaPipe is limited.
  • 2 The semi-synthetic data generation pipeline does not yet capture the full diversity of real-world hand appearances and motions, restricting generalization.
  • 3 Expanding gesture vocabularies beyond pointing to support richer, more complex interactions is an unresolved problem.
  • 4 Optimizing the computational efficiency and real-time performance of flow-based action generation is necessary for practical deployment.
  • 5 Effective fusion of multimodal instructions in multi-robot collaborative settings remains unexplored.
  • 6 Maintaining stable gesture recognition and action generation in dynamic, unstructured environments is a technical challenge.
  • 7 Adaptive weighting between language and gesture modalities for instruction interpretation has not been thoroughly investigated.

Applications

Immediate Applications

Intelligent Warehouse Picking

Workers can use combined gesture and language commands to direct robots for accurate and efficient item picking in warehouses.

Service Robot Assistance

Home or healthcare robots can interpret natural hand gestures alongside speech to perform complex object retrieval and delivery tasks.

Industrial Assembly Line Operation

Operators can guide robots via multimodal instructions, improving precision and flexibility in assembly tasks.

Long-term Vision

Multimodal Human-Robot Collaboration Platforms

Future systems integrating gesture, language, and vision for seamless human-robot teamwork in complex environments.

Augmented Reality Assisted Robotics

Combining AR with real-time gesture and language recognition to guide robots in dynamic tasks, advancing smart homes and industrial automation.

Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we introduce gesture as a parallel instruction modality and propose a Gesture-aware Vision-Language-Action model (GesVLA). Our approach encodes gesture features directly into the latent space, enabling them to participate in both high-level reasoning and low-level action generation, and adopts a dual-VLM architecture to achieve tight coupling between gesture representations and action policies. At the data level, we construct a scalable gesture data generation pipeline by rendering hand models onto real-world scene images. This reduces the sim-to-real visual gap while producing rich data with diverse motion patterns and corresponding pointing annotations. In addition, we employ a two-stage training strategy to equip the model with both gesture perception and action prediction capabilities. We evaluate our approach on multiple real-world robotic tasks, including a controlled block manipulation task for validation and more practical scenarios such as product and produce selection. Experimental results show that incorporating gesture consistently improves target grounding accuracy and human-robot interaction efficiency, especially in complex and cluttered environments. Project page: https://gwxuan.github.io/GesVLA/.

cs.RO cs.CV

References (20)

π0: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess et al.

2024 1649 citations ⭐ Influential View Analysis →

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

L. Shi, Brian Ichter, Michael Equi et al.

2025 177 citations View Analysis →

DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

En Yu, Haoran Lv, Jianjian Sun et al.

2026 7 citations View Analysis →

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

Tianyuan Yuan, Yicheng Liu, Chenhao Lu et al.

2025 27 citations View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 5261 citations View Analysis →

VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation

Chaofan Zhang, Peng Hao, Xiaoge Cao et al.

2025 45 citations View Analysis →

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren et al.

2023 4094 citations View Analysis →

PointVLA: Injecting the 3D World into Vision-Language-Action Models

Chengmeng Li, Junjie Wen, Yan Peng et al.

2025 80 citations View Analysis →

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.

2024 2164 citations View Analysis →

MediaPipe Hands: On-device Real-time Hand Tracking

Fan Zhang, Valentin Bazarevsky, Andrey Vakunov et al.

2020 1053 citations View Analysis →

A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM

ByungOk Han, Jaehong Kim, Jinhyeok Jang

2024 39 citations View Analysis →

GestLLM: Advanced Hand Gesture Interpretation via Large Language Models for Human-Robot Interaction

Oleg Kobzarev, Artem Lykov, Dzmitry Tsetserukou

2025 12 citations View Analysis →

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Nvidia, Johan Bjorck, Fernando Castañeda et al.

2025 757 citations View Analysis →

OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

Fanqi Lin, Ruiqian Nai, Yingdong Hu et al.

2025 89 citations View Analysis →

Gesture-Informed Robot Assistance via Foundation Models

Li-Heng Lin, Yuchen Cui, Yilun Hao et al.

2023 35 citations View Analysis →

Diver Interest via Pointing: Human-Directed Object Inspection for AUVs

Chelsey Edge, Junaed Sattar

2022 7 citations View Analysis →

Pointing-Guided Target Estimation via Transformer-Based Attention

Lucas-Raphael Müller, Hassan Ali, Philipp Allgeuer et al.

2025 2 citations View Analysis →

Learning from Unscripted Deictic Gesture and Language for Human-Robot Interactions

Cynthia Matuszek, Liefeng Bo, Luke Zettlemoyer et al.

2014 156 citations

Point What You Mean: Visually Grounded Instruction Policy

Hang Yu, Juntu Zhao, Yufeng Liu et al.

2025 6 citations View Analysis →

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, André Susano Pinto et al.

2024 626 citations View Analysis →