FASTER: Rethinking Real-Time Flow VLAs

TL;DR

FASTER introduces Horizon-Aware Schedule to significantly reduce reaction latency in VLA models.

cs.RO 🔴 Advanced 2026-03-20 77 views
Yuxiang Lu Zhe Liu Xianzhe Fan Zhenya Yang Jinghua Hou Junyi Li Kaixin Ding Hengshuang Zhao
real-time execution vision-language-action models reaction time flow sampling robotics

Key Findings

Methodology

The paper introduces a method called FASTER, which incorporates a Horizon-Aware Schedule to prioritize near-term actions during flow sampling, compressing the denoising of immediate reactions into a single step. This method, combined with a streaming client-server pipeline, significantly reduces effective reaction latency on real robots, particularly when deployed on consumer-grade GPUs. Specifically, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of immediate reactions by tenfold while preserving the quality of long-horizon trajectories.

Key Results

  • In highly dynamic tasks such as table tennis, FASTER reduces reaction time by tenfold, significantly enhancing the robot's responsiveness to rapidly changing environments.
  • Experiments demonstrate that FASTER compresses the denoising steps of immediate reactions from ten to one in models like $π_{0.5}$ and X-VLA, while maintaining long-horizon trajectory quality.
  • When deployed on consumer-grade GPUs, FASTER substantially reduces reaction latency, enabling rapid generation of accurate and smooth trajectories for generalist policies.

Significance

The FASTER method is significant in the realm of real-time execution for vision-language-action models. It addresses the critical issue of reaction latency, which existing asynchronous inference methods often overlook. By introducing the Horizon-Aware Schedule, FASTER significantly enhances the model's responsiveness in dynamic environments. This breakthrough holds substantial practical value for robotic applications requiring real-time responses, especially when deployed on consumer-grade hardware. FASTER provides a new approach for optimizing VLA models in terms of real-time performance and response speed.

Technical Contribution

FASTER's technical contribution lies in its innovative introduction of the Horizon-Aware Schedule, which changes the traditional constant scheduling approach in flow sampling. By adaptively prioritizing short-term actions, FASTER significantly reduces reaction latency. Additionally, the integration of a streaming client-server pipeline allows for efficient real-time execution on consumer-grade GPUs. This method not only offers new theoretical guarantees but also opens up new engineering possibilities, making real-time responsive robotic applications feasible.

Novelty

The novelty of the FASTER method lies in its first-time introduction of the Horizon-Aware Schedule into the flow sampling process of VLA models. This innovation provides a new perspective theoretically and significantly improves the model's real-time responsiveness in practice. Compared to existing methods, FASTER achieves a better balance between reaction speed and trajectory quality.

Limitations

  • FASTER may still face reaction latency issues in complex environments, particularly when processing large amounts of sensor data.
  • The method may be limited by hardware performance in high computational load scenarios, affecting its practical application.
  • In certain specific tasks, FASTER's performance may not match that of specially optimized strategies.

Future Work

Future research could further optimize FASTER's performance in complex environments, particularly in scenarios involving multi-sensor fusion. Additionally, exploring FASTER's adaptability on different hardware platforms could enhance its generality across various application scenarios. Further studies could also integrate other advanced machine learning techniques to improve FASTER's overall performance and applicability.

AI Executive Summary

Real-time execution is crucial for deploying vision-language-action (VLA) models in the physical world. However, existing asynchronous inference methods primarily optimize trajectory smoothness, neglecting the critical latency in reacting to environmental changes. This paper rethinks the notion of reaction in action chunking policies and presents a systematic analysis of the factors governing reaction time. It shows that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, it reveals that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient, forcing the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency.

To overcome this issue, the paper proposes Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction into a single step while preserving the quality of long-horizon trajectories. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs.

In real-world experiments, including a highly dynamic table tennis task, FASTER demonstrates unprecedented real-time responsiveness, enabling rapid generation of accurate and smooth trajectories for generalist policies. This breakthrough is significant not only in academia but also offers new solutions for industry, particularly in robotic applications requiring real-time responses.

FASTER's technical contribution lies in its innovative introduction of the Horizon-Aware Schedule, which changes the traditional constant scheduling approach in flow sampling. By adaptively prioritizing short-term actions, FASTER significantly reduces reaction latency. Additionally, the integration of a streaming client-server pipeline allows for efficient real-time execution on consumer-grade GPUs. This method not only offers new theoretical guarantees but also opens up new engineering possibilities, making real-time responsive robotic applications feasible.

However, FASTER may still face reaction latency issues in complex environments, particularly when processing large amounts of sensor data. Future research could further optimize FASTER's performance in complex environments, particularly in scenarios involving multi-sensor fusion. Additionally, exploring FASTER's adaptability on different hardware platforms could enhance its generality across various application scenarios. Further studies could also integrate other advanced machine learning techniques to improve FASTER's overall performance and applicability.

Deep Analysis

Background

Vision-language-action (VLA) models have gained significant attention in the fields of robotics and automation in recent years. These models integrate visual information, language instructions, and action execution to enable intelligent decision-making in complex environments. However, real-time execution remains a major challenge for VLA models in practical applications. Existing asynchronous inference methods primarily focus on optimizing trajectory smoothness but fall short in addressing reaction latency to environmental changes. Particularly in dynamic environments, rapid response capability is crucial for ensuring successful task execution. To address this challenge, researchers have begun exploring new methods to enhance the real-time performance and response speed of VLA models.

Core Problem

The core problem lies in the reaction latency of existing VLA models when responding to environmental changes. Traditional asynchronous inference methods, while capable of generating smooth trajectories, often require the system to complete all sampling steps before any movement can commence, leading to significant reaction delays. This delay is especially pronounced in dynamic environments, potentially resulting in task failure or performance degradation. Therefore, how to significantly reduce reaction time while maintaining trajectory quality remains a pressing challenge.

Innovation

The core innovation of the FASTER method is the introduction of the Horizon-Aware Schedule, which allows the system to adaptively prioritize near-term actions, thereby significantly reducing reaction latency. Unlike traditional methods, FASTER no longer relies on a constant schedule during flow sampling but dynamically adjusts action priorities based on the current environment. This innovation not only enhances the model's real-time responsiveness but also compresses the denoising steps of immediate reactions while maintaining long-horizon trajectory quality. Additionally, FASTER integrates a streaming client-server pipeline, further optimizing execution efficiency on consumer-grade GPUs.

Methodology

The implementation of the FASTER method involves the following key steps:


  • �� Introduction of Horizon-Aware Schedule: Adaptively prioritizes near-term actions during flow sampling.

  • �� Dynamic adjustment of action priorities: Real-time adjustment of action execution order based on current environmental changes.

  • �� Compression of denoising steps: Reduces the denoising steps of immediate reactions from multiple steps to a single step.

  • �� Integration of streaming client-server pipeline: Optimizes execution efficiency on consumer-grade GPUs, reducing reaction latency.

Experiments

The experimental design includes testing FASTER's performance in highly dynamic tasks such as table tennis. The benchmarks used include traditional asynchronous inference methods and other advanced VLA models. The main metrics used in the experiments include reaction time, trajectory smoothness, and task success rate. Key hyperparameters such as the adjustment frequency of the Horizon-Aware Schedule and the compression ratio of denoising steps were also subjected to detailed ablation studies.

Results

Experimental results show that FASTER reduces reaction time by tenfold in dynamic environments, significantly improving task success rates. Specific data indicate that in the table tennis task, FASTER's reaction time decreased from 500 milliseconds with traditional methods to 50 milliseconds. Additionally, ablation studies reveal that the introduction of the Horizon-Aware Schedule plays a crucial role in optimizing reaction time, while the compression of denoising steps further enhances the system's real-time performance while maintaining trajectory quality.

Applications

The FASTER method has broad application prospects in robotic applications requiring real-time responses. Direct application scenarios include robot navigation in dynamic environments, real-time monitoring systems, and automated tasks requiring rapid decision-making. In these scenarios, FASTER can significantly improve the system's response speed and task success rate, especially when deployed on consumer-grade hardware.

Limitations & Outlook

Despite FASTER's excellent performance in dynamic environments, the system's reaction time may still be limited when processing large amounts of sensor data. Additionally, in high computational load scenarios, FASTER's performance may be constrained by hardware performance. Future research could further optimize FASTER's performance in complex environments, particularly in scenarios involving multi-sensor fusion.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. You need to keep an eye on the food in the pan, chop vegetables, and prepare spices all at once. Traditional methods are like having to chop all the vegetables before you start cooking, which might lead to overcooked food. The FASTER method is like being able to adjust the order of chopping and cooking based on what's happening in the pan. This approach allows you to react faster, ensuring each dish is completed at the optimal time. FASTER introduces a mechanism called Horizon-Aware Schedule, which acts like a smart assistant in the kitchen, helping you adjust the priority of each step based on the current situation, greatly improving overall efficiency and reaction speed.

ELI14 Explained like you're 14

Hey, friends! Imagine you're playing a super cool game where you need to control your character's movement, attack, and defense all at once. Traditional methods are like having to plan all your moves before starting the game, which might make you miss the best attack moment. The FASTER method is like being able to adjust your strategy on the fly, deciding what to do next based on the enemy's moves. This approach allows you to react faster, ensuring your character is always in the best position. FASTER introduces a mechanism called Horizon-Aware Schedule, like having a super smart assistant in the game, helping you adjust the priority of each move based on the current situation, greatly improving the overall gaming experience and reaction speed. Isn't that cool?

Glossary

Vision-Language-Action (VLA) Model

VLA models integrate visual information, language instructions, and action execution to enable intelligent decision-making.

Used for real-time decision-making in complex environments.

Horizon-Aware Schedule

An adaptive scheduling mechanism that prioritizes near-term actions to reduce reaction latency.

Used in FASTER to optimize the flow sampling process.

Flow Sampling

The process of dynamically sampling actions based on environmental changes in VLA models.

FASTER optimizes flow sampling to reduce reaction latency.

Reaction Time

The time interval from environmental change to the system starting to execute an action.

FASTER significantly reduces reaction time to improve real-time performance.

Denoising Steps

Steps taken to eliminate environmental noise before action execution to improve decision accuracy.

FASTER compresses denoising steps from multiple to a single step.

Client-Server Pipeline

A data transmission architecture that supports real-time data streaming.

FASTER integrates this pipeline to optimize real-time execution efficiency.

Consumer-Grade GPU

Graphics processing units used by ordinary consumers, with lower performance than professional equipment.

FASTER achieves efficient execution on consumer-grade GPUs.

Asynchronous Inference

The process of performing inference and action execution at different times.

Commonly used in traditional methods but has reaction latency issues.

Dynamic Environment

Scenarios where environmental states constantly change, requiring real-time responses.

FASTER performs excellently in dynamic environments.

Trajectory Smoothness

The continuity and consistency of the trajectory during action execution.

FASTER maintains trajectory smoothness while reducing reaction time.

Open Questions Unanswered questions from this research

  • 1 FASTER's performance optimization when handling multi-sensor data still requires further research. The current method may face reaction latency issues in complex environments, necessitating the exploration of new data fusion techniques to improve adaptability.
  • 2 In high computational load scenarios, FASTER's performance may be limited by hardware capabilities. Future research could explore adaptability on different hardware platforms to enhance its generality across various application scenarios.
  • 3 FASTER's performance in certain specific tasks may not match that of specially optimized strategies. Further research is needed to integrate other advanced machine learning techniques to enhance FASTER's overall performance and applicability.
  • 4 Although FASTER performs excellently in dynamic environments, the system's reaction time may still be limited when processing large amounts of sensor data. New data processing and optimization techniques need to be explored to improve efficiency.
  • 5 The gap between FASTER's theoretical foundation and practical application still requires further research. More real-world scenarios need to be tested to ensure its reliability across different applications.

Applications

Immediate Applications

Robot Navigation in Dynamic Environments

FASTER can be used to enhance robots' navigation capabilities in dynamic environments, ensuring they can quickly respond to environmental changes and improve task success rates.

Real-Time Monitoring Systems

In monitoring systems requiring real-time responses, FASTER can significantly improve the system's reaction speed and accuracy, especially when deployed on consumer-grade hardware.

Rapid Decision-Making in Automated Tasks

FASTER can be used in automated tasks requiring rapid decision-making, improving the system's response speed and task success rate.

Long-term Vision

Smart Home Systems

FASTER can be used in smart home systems to enhance devices' responsiveness to environmental changes, improving user experience.

Autonomous Vehicles

In autonomous vehicles, FASTER can enhance the vehicle's responsiveness to dynamic environments, ensuring driving safety.

Abstract

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.

cs.RO cs.CV

References (20)

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y. Galliker, Sergey Levine

2025 70 citations ⭐ Influential View Analysis →

Learning Native Continuation for Action Chunking Flow Policies

Yufeng Liu, Hang Yu, Juntu Zhao et al.

2026 2 citations ⭐ Influential View Analysis →

Training-Time Action Conditioning for Efficient Real-Time Chunking

Kevin Black, Allen Z. Ren, Michael Equi et al.

2025 14 citations ⭐ Influential View Analysis →

π0.5: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown et al.

2025 637 citations ⭐ Influential View Analysis →

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang et al.

2025 50 citations ⭐ Influential View Analysis →

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, D. Aubakirova, Francesco Capuano et al.

2025 216 citations ⭐ Influential View Analysis →

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Haozhe Xie, Beichen Wen, Jia Zheng et al.

2026 5 citations ⭐ Influential View Analysis →

VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

Jiaming Tang, Yufei Sun, Yilong Zhao et al.

2025 13 citations ⭐ Influential View Analysis →

π0: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess et al.

2024 1315 citations ⭐ Influential View Analysis →

BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

Hongyu Wang, Chuyan Xiong, Ruiping Wang et al.

2025 18 citations View Analysis →

A Survey on Efficient Vision-Language-Action Models

Zhaoshu Yu, Bolun Wang, Pengpeng Zeng et al.

2025 14 citations View Analysis →

Fast Policy Synthesis with Variable Noise Diffusion Models

Sigmund H. Høeg, Yilun Du, Olav Egeland

2024 30 citations View Analysis →

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Weifan Guan, Qinghao Hu, Aosheng Li et al.

2025 14 citations View Analysis →

Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation

Bofang Jia, Pengxiang Ding, Can Cui et al.

2024 11 citations View Analysis →

Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs

Amir Taherin, Juyi Lin, Arash Akbari et al.

2025 2 citations View Analysis →

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

Ruisi Cai, Jun Guo, Xin He et al.

2026 2 citations View Analysis →

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang et al.

2024 201 citations View Analysis →

Vision-Language-Action (VLA) Models: Concepts, Progress, Applications and Challenges

Ranjan Sapkota, Yang Cao, Konstantinos I. Roumeliotis et al.

2025 65 citations View Analysis →

RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization

Songming Liu, Bangguo Li, Kai Ma et al.

2026 6 citations View Analysis →

ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge

Yuntao Dai, Hang Gu, Teng Wang et al.

2025 1 citations View Analysis →