GazeVLA: Learning Human Intention for Robotic Manipulation

TL;DR

GazeVLA learns human intention to enhance robotic manipulation, significantly outperforming baseline methods.

cs.RO 🔴 Advanced 2026-04-24 30 views
Chengyang Li Kaiyi Xiong Yuan Xu Lei Qian Yizhou Wang Wentao Zhu
robotic manipulation human intention vision-language models gaze tracking cross-domain learning

Key Findings

Methodology

The GazeVLA framework learns and transfers human intention to facilitate robotic manipulation through the Vision-Language-Intention-Action (VLIA) model. The method first pretrains the model on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, predicting intention before executing action.

Key Results

  • In the AV-ALOHA benchmark, GazeVLA performs excellently in both in-distribution and out-of-distribution scenarios, achieving a 22% relative improvement over the 0.5 model in out-of-distribution settings.
  • In real-world robot experiments, GazeVLA excels in grasping and fine manipulation tasks, achieving success rates of 85% and twice that of the 0.5 model, respectively.
  • Ablation studies show that the intention-action reasoning chain significantly improves manipulation performance, especially in long-horizon tasks and fine-grained operations.

Significance

The introduction of GazeVLA holds significant implications for both academia and industry. By introducing human intention as an intermediate representation, it addresses the embodiment gap between humans and robots. This method not only enhances the generalization capabilities of robotic manipulation but also provides new insights for future cross-domain learning research. Its excellent performance in long-horizon tasks and fine-grained operations demonstrates its potential in complex robotic tasks.

Technical Contribution

GazeVLA's technical contributions lie in its innovative use of human intention as an intermediate representation, leveraging gaze signals for intention modeling, and implementing an intention-to-action reasoning chain through a vision-language model. Compared to state-of-the-art methods, this approach offers new theoretical guarantees and engineering possibilities, particularly in handling complex tasks and improving generalization capabilities.

Novelty

GazeVLA is the first to explicitly model human intention as an intermediate representation, capturing it through gaze signals. Compared to existing methods based on visual or behavioral imitation, it provides deeper intention understanding and cross-domain knowledge transfer capabilities.

Limitations

  • In certain complex scenarios, gaze signals may not accurately reflect human intention, leading to prediction biases.
  • The reliance on large-scale, high-quality human data may limit the method's applicability.
  • The effectiveness of intention transfer may be affected in the absence of intention annotations in robot data.

Future Work

Future research directions include exploring more efficient intention modeling methods, reducing dependence on large-scale human data, and validating GazeVLA's effectiveness on more diverse robotic platforms. Additionally, integrating other perception signals (such as speech or gestures) to enhance intention understanding is a promising direction.

AI Executive Summary

In recent years, significant progress has been made in the field of robotic manipulation, particularly in foundational models that integrate visual and language information. However, these models heavily rely on large-scale real-robot data, which is costly and difficult to scale, becoming a bottleneck for further development. To address this issue, researchers have begun exploring the use of human data as a training source. However, the embodiment gap between humans and robots poses a major challenge in effectively extracting transferable knowledge.

The GazeVLA framework facilitates robotic manipulation by learning and transferring human intention. Specifically, it models intention through gaze, as gaze naturally precedes physical actions and serves as an observable proxy for human intent. The model is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, predicting intention before executing action.

Extensive evaluations in both simulation and real-world settings demonstrate GazeVLA's superior performance across long-horizon and fine-grained tasks, as well as under few-shot and robustness benchmarks. Notably, in the AV-ALOHA benchmark, GazeVLA excels in both in-distribution and out-of-distribution scenarios, achieving a 22% relative improvement over the 0.5 model in out-of-distribution settings. In real-world robot experiments, GazeVLA excels in grasping and fine manipulation tasks, achieving success rates of 85% and twice that of the 0.5 model, respectively.

The introduction of GazeVLA holds significant implications for both academia and industry. By introducing human intention as an intermediate representation, it addresses the embodiment gap between humans and robots. This method not only enhances the generalization capabilities of robotic manipulation but also provides new insights for future cross-domain learning research. Its excellent performance in long-horizon tasks and fine-grained operations demonstrates its potential in complex robotic tasks.

However, GazeVLA also has limitations. In certain complex scenarios, gaze signals may not accurately reflect human intention, leading to prediction biases. Additionally, the reliance on large-scale, high-quality human data may limit the method's applicability. The effectiveness of intention transfer may be affected in the absence of intention annotations in robot data. Future research directions include exploring more efficient intention modeling methods, reducing dependence on large-scale human data, and validating GazeVLA's effectiveness on more diverse robotic platforms.

Deep Analysis

Background

In recent years, the field of robotic manipulation has seen significant advancements, driven by improvements in computational power and data collection technologies. Many studies have focused on enhancing robotic intelligence through the integration of visual and language information. For instance, Vision-Language Models (VLMs) have demonstrated exceptional performance in combining visual and language information. However, these models typically rely on large-scale real-robot data for training, which is costly and difficult to scale, becoming a bottleneck for further development. To overcome this limitation, researchers have begun exploring the use of human data as a training source. Human data is not only easier to collect but also naturally encodes rich high-level behavioral structures, including operational intent, task decomposition, and object-centric affordances, which are valuable for learning transferable manipulation skills.

Core Problem

Despite the potential of human data as a training source, effectively extracting and transferring knowledge from it remains a major challenge. The embodiment gap between humans and robots makes direct imitation of human behavior difficult. Existing methods largely rely on visual or behavioral imitation, lacking deep understanding of human intention. Additionally, achieving cross-domain transfer of intention in the absence of robot intention annotations is an unsolved problem.

Innovation

The GazeVLA framework addresses these challenges through the following innovations:


  • �� Intention Modeling: For the first time, human intention is explicitly modeled as an intermediate representation and captured through gaze signals. This approach provides deeper intention understanding and cross-domain knowledge transfer capabilities.

  • �� Chain-of-Thought Reasoning: A Chain-of-Thought reasoning paradigm is adopted, predicting intention before executing action, enhancing the model's reasoning and generalization capabilities.

  • �� Vision-Language-Intention-Action Model (VLIA): By integrating visual and language information, intention modeling is achieved, facilitating an intention-to-action reasoning chain, improving the precision and robustness of robotic manipulation.

Methodology

The implementation of the GazeVLA framework involves the following key steps:


  • �� Data Collection and Processing: A large-scale egocentric human dataset is constructed, containing hand and gaze annotations. The dataset covers diverse scenes and interaction types, providing rich prior knowledge for learning human behavior and intention.

  • �� Model Architecture: PaliGemma is utilized as the VLM backbone, incorporating a SigLIP vision encoder and a Gemma-2B language model to process multimodal information. The action expert generates high-frequency continuous actions through conditional flow matching.

  • �� Intention-Action Reasoning Chain: An intention-action reasoning chain is introduced, explicitly decomposing decision-making into perception, intention inference, and action generation. Gaze is adopted as an explicit representation of intention and discretized into tokens via spatial binning.

  • �� Loss Function: Includes an intention prediction loss for the VLM and an action generation loss for the action expert. The intention loss is formulated as a standard autoregressive next-token prediction objective, and the action loss is constructed based on a flow matching formulation.

  • �� Training Strategy: A staged training strategy is adopted, initially freezing the vision encoder and the vision-language model, optimizing only the action expert. Subsequently, all model parameters are unfrozen and jointly optimized.

Experiments

The experimental design includes extensive evaluations of GazeVLA in both simulation and real-world settings. The benchmarks used include AV-ALOHA and real-world robot experiments. In the AV-ALOHA benchmark, the robot platform consists of two 7-DoF arms for bimanual manipulation and an additional 7-DoF arm equipped with a camera for active vision. Human gaze annotations are collected via teleoperation using a VR device. The experiments compare several baseline methods, including LFA, DP, H-RDT, and the 0.5 model. Each model is trained using 100 trajectories per task and evaluated over 100 inference trials. To rigorously assess robustness, distractors and lighting variations are introduced during evaluation.

Results

Experimental results show that GazeVLA performs excellently in both in-distribution and out-of-distribution scenarios, achieving a 22% relative improvement over the 0.5 model in out-of-distribution settings. In real-world robot experiments, GazeVLA excels in grasping and fine manipulation tasks, achieving success rates of 85% and twice that of the 0.5 model, respectively. Ablation studies show that the intention-action reasoning chain significantly improves manipulation performance, especially in long-horizon tasks and fine-grained operations. Intention prediction remains robust under background changes, and intention-guided action generation enables GazeVLA to better handle out-of-distribution scenarios.

Applications

The application scenarios of GazeVLA include long-horizon tasks and fine-grained operations in complex robotic tasks. Its excellent performance in grasping, fine manipulation, and bimanual dexterous manipulation demonstrates its potential in industrial automation, smart manufacturing, and service robotics. By introducing human intention as an intermediate representation, GazeVLA can achieve more efficient operations in diverse scenarios, reducing dependence on large-scale robot data.

Limitations & Outlook

Despite GazeVLA's outstanding performance in several aspects, it also has limitations. In certain complex scenarios, gaze signals may not accurately reflect human intention, leading to prediction biases. Additionally, the reliance on large-scale, high-quality human data may limit the method's applicability. The effectiveness of intention transfer may be affected in the absence of intention annotations in robot data. Future research directions include exploring more efficient intention modeling methods, reducing dependence on large-scale human data, and validating GazeVLA's effectiveness on more diverse robotic platforms.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. You first decide what dish to make (intention), then gather the ingredients according to the recipe (visual and language information), and finally start cooking (action). GazeVLA is like a smart assistant that can guess what dish you want to make by observing your gaze (intention), then helps you gather all the necessary ingredients and guides you through the cooking process. This assistant can help you cook not only in your familiar kitchen but also in an unfamiliar one because it understands your intention and adjusts its actions based on different environments. In this way, GazeVLA can achieve more efficient operations in diverse scenarios, reducing dependence on large-scale robot data.

ELI14 Explained like you're 14

Hey there! Have you ever wondered how robots know what we want them to do? Like, if you want a robot to pick up a book from the table, how does it know your intention? That's where GazeVLA comes in! It can guess your intention by observing your gaze, just like when you're playing a game and your eyes are focused on a certain spot on the screen, the game character knows you want to go there. GazeVLA is like a super-smart robot assistant that understands your intention and helps you complete tasks. Whether at home or school, it can perform excellently because it adjusts its actions based on different environments. Isn't that cool?

Glossary

GazeVLA

GazeVLA is a framework that enhances robotic manipulation by learning human intention. It models intention through gaze signals and implements an intention-to-action reasoning chain via a vision-language model.

In the paper, GazeVLA is used to address the embodiment gap between humans and robots.

Vision-Language Model

A vision-language model combines visual and language information, enabling the understanding and generation of multimodal information.

In GazeVLA, the vision-language model is used to process multimodal information and achieve intention-to-action reasoning.

Egocentric Human Dataset

An egocentric human dataset is collected from a first-person perspective, typically containing rich multimodal information such as gaze and hand movements.

GazeVLA uses an egocentric human dataset for pretraining to capture human intention and its synergy with action.

Chain-of-Thought Reasoning

Chain-of-Thought reasoning is a paradigm that predicts intention before executing action, enhancing the model's reasoning and generalization capabilities.

In GazeVLA, Chain-of-Thought reasoning is used to achieve intention-to-action reasoning.

Intention Modeling

Intention modeling refers to capturing and representing human intention in some way. In GazeVLA, intention is modeled through gaze signals.

Intention modeling is a core innovation of GazeVLA, used to address the embodiment gap between humans and robots.

Flow Matching

Flow matching is a technique used to generate high-frequency continuous actions through conditional flow matching.

In GazeVLA, flow matching is used by the action expert to generate high-frequency continuous actions.

PaliGemma

PaliGemma is a backbone network for vision-language models, incorporating a SigLIP vision encoder and a Gemma-2B language model.

In GazeVLA, PaliGemma is used to process multimodal information and achieve intention-to-action reasoning.

SigLIP

SigLIP is a vision encoder used to process visual information.

In GazeVLA, SigLIP is part of PaliGemma, used to process visual information.

Gemma-2B

Gemma-2B is a language model used to process language information.

In GazeVLA, Gemma-2B is part of PaliGemma, used to process language information.

AV-ALOHA Benchmark

The AV-ALOHA benchmark is used to evaluate robotic manipulation performance, integrating human gaze supervision with active visual perception.

In GazeVLA's experiments, the AV-ALOHA benchmark is used to assess the model's performance in simulation environments.

Open Questions Unanswered questions from this research

  • 1 How can intention be effectively modeled and transferred without large-scale, high-quality human data? Current methods rely on large-scale data, which may limit their application in resource-constrained environments.
  • 2 In complex scenarios, gaze signals may not accurately reflect human intention. How can the accuracy of intention prediction be improved, especially in multi-task or distractor-rich environments?
  • 3 How can GazeVLA be applied to more diverse robotic platforms? Current experiments focus on specific robotic platforms, and verifying its effectiveness on others is a promising direction.
  • 4 How can other perception signals (such as speech or gestures) be integrated to enhance intention understanding? While gaze signals are effective, they may not fully capture human intention in some cases.
  • 5 How can intention be transferred across domains in the absence of intention annotations in robot data? Current methods lack intention annotations in robot data, which may affect the effectiveness of intention transfer.

Applications

Immediate Applications

Industrial Automation

GazeVLA can be used in complex robotic tasks in industrial automation, such as fine operations on assembly lines. By learning human intention, robots can achieve more efficient operations in diverse scenarios.

Smart Manufacturing

In smart manufacturing, GazeVLA can help robots better understand and execute complex manufacturing tasks, reducing dependence on large-scale robot data and improving production efficiency.

Service Robotics

GazeVLA can be applied in the field of service robotics, such as home assistant robots, providing more intelligent and personalized services by understanding human intention.

Long-term Vision

Cross-Domain Learning

GazeVLA's intention modeling method provides new insights for future cross-domain learning research, potentially enabling effective knowledge transfer in more fields.

Human-Robot Collaboration

By better understanding human intention, GazeVLA is expected to play a significant role in future human-robot collaboration, promoting more natural and efficient cooperation.

Abstract

Embodied foundation models have achieved significant breakthroughs in robotic manipulation, yet they still depend heavily on large-scale robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent embodiment gap between human and robot. We argue that the intention underlying human actions can serve as a powerful intermediate representation for bridging this gap. In this paper, we introduce a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. Our model is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations in simulation and real-world settings, across long-horizon and fine-grained tasks, and under few-shot and robustness benchmarks, show that our method consistently outperforms strong baselines, generalizes better, and achieves state-of-the-art performance.

cs.RO

References (20)

Emergence of Human to Robot Transfer in Vision-Language-Action Models

Simar Kareer, Karl Pertsch, James Darpinian et al.

2025 17 citations ⭐ Influential View Analysis →

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, S. Feng, Yilun Du et al.

2023 2884 citations ⭐ Influential View Analysis →

H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

Hongzhe Bi, Lingxuan Wu, Tianwei Lin et al.

2025 23 citations ⭐ Influential View Analysis →

Active Vision Might Be All You Need: Exploring Active Vision in Bimanual Robotic Manipulation

Ian Chuang, Andrew Lee, Dechen Gao et al.

2024 34 citations View Analysis →

Learning Video Representations from Large Language Models

Yue Zhao, Ishan Misra, Philipp Krahenbuhl et al.

2022 250 citations View Analysis →

EMMA: Scaling Mobile Manipulation via Egocentric Human Data

Lawrence Y. Zhu, Pranav Kuppili, Ryan Punamiya et al.

2025 16 citations View Analysis →

Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

Ian Chuang, Andrew Lee, Dechen Gao et al.

2025 11 citations View Analysis →

Egocentric Video-Language Pretraining

Kevin Lin, Alex Wang, Mattia Soldan et al.

2022 271 citations View Analysis →

EgoMe: A New Dataset and Challenge for Following Me via Egocentric View in Real World

Heqian Qiu, Zhaofeng Shi, Lanxiao Wang et al.

2025 4 citations View Analysis →

Ego4D: Around the World in 3,000 Hours of Egocentric Video

K. Grauman, Andrew Westbury, Eugene Byrne et al.

2021 1687 citations View Analysis →

Embodied Hands : Modeling and Capturing Hands and Bodies Together * * Supplementary Material * *

Javier Romero, Dimitrios Tzionas

2017 1225 citations

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.

2024 1982 citations View Analysis →

EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data

Ruijie Zheng, Dantong Niu, Yuqi Xie et al.

2026 5 citations View Analysis →

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Xiaomeng Xu, Jisang Park, Han Zhang et al.

2026 4 citations View Analysis →

AgiBot World Colosseo: A Large-Scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot-World-Contributors, Qingwen Bu, Jisong Cai et al.

2025 290 citations View Analysis →

FLARE: Robot Learning with Implicit World Modeling

Ruijie Zheng, Jing Wang, Scott Reed et al.

2025 44 citations View Analysis →

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Jiangran Lyu, Kai Liu, Xuheng Zhang et al.

2026 2 citations View Analysis →

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Jiange Yang, Yansong Shi, Haoyi Zhu et al.

2025 22 citations View Analysis →

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu et al.

2023 9067 citations View Analysis →

ViPRA: Video Prediction for Robot Actions

Sandeep Routray, Hengkai Pan, Unnat Jain et al.

2025 6 citations View Analysis →