FASTER: Value-Guided Sampling for Fast RL

TL;DR

FASTER method reduces computational cost by early action sample filtering during denoising while maintaining RL performance.

cs.LG 🔴 Advanced 2026-04-22 38 views

Perry Dong Alexander Swerdlow Dorsa Sadigh Chelsea Finn

AI Reader Arxiv Page Download PDF

reinforcement learning denoising process Markov decision process sampling method computational efficiency

Key Findings

Methodology

FASTER models the denoising of multiple action candidates as a Markov Decision Process (MDP), progressively filtering action candidates before denoising is complete. By learning a policy and value function in the denoising space, it predicts the downstream value of action candidates and filters them while maximizing returns. This method is lightweight and can be integrated into existing generative RL algorithms.

Key Results

FASTER consistently improves underlying policies in long-horizon manipulation tasks in both online and batch-online RL, outperforming compared methods. Specifically, it shows significant performance gains in Robomimic and LIBERO tasks.
Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Experiments show a reduction in update step time from 11.6s to 2.5s and inference time from 566ms to 335ms.
By filtering at the noise level, FASTER captures the sample variance signal exploited by best-of-N selection without fully denoising all action samples.

Significance

FASTER recovers the performance gains of sampling-based test-time scaling without incurring computational costs. By early filtering of action candidates during denoising, FASTER reduces computational bottlenecks, especially in large models like modern Vision-Language-Action (VLA) models. This method provides a practical solution for resource-constrained or latency-sensitive environments.

Technical Contribution

FASTER offers a new perspective on action candidate filtering by modeling the denoising process as an MDP. Unlike existing methods, FASTER filters at the noise level, reducing computational costs. This method not only provides new theoretical guarantees but also new engineering possibilities, allowing seamless integration with existing generative RL algorithms.

Novelty

FASTER is the first method to reduce computational costs by early filtering of action samples during denoising. Compared to existing methods, FASTER filters at the noise level rather than selecting after full denoising, significantly reducing computational costs.

Limitations

While FASTER significantly improves computational efficiency, it does not enhance sample efficiency. Its performance gains largely rely on the inherent sample efficiency of the base algorithms.
The method is only applicable to policy classes using initial noise seeds and cannot be directly applied to those lacking such a structure.
In some complex tasks, FASTER may not fully replace traditional sampling methods, especially in scenarios requiring high precision.

Future Work

Future research directions include: 1) improving the sample efficiency of FASTER, 2) extending to policy classes without initial noise seeds, 3) exploring applications in other domains such as autonomous driving and complex system control.

AI Executive Summary

In modern reinforcement learning, many of the most expressive algorithms require sampling multiple action candidates and selecting the best one at test time, leading to high computational costs. This is particularly problematic in large models like modern Vision-Language-Action (VLA) models, where computational demands can become a bottleneck.

The FASTER method addresses this issue by early filtering of action samples during the denoising process. Specifically, FASTER models the denoising of multiple action candidates as a Markov Decision Process (MDP), progressively filtering action candidates before denoising is complete. By learning a policy and value function in the denoising space, FASTER can predict the downstream value of action candidates and filter them while maximizing returns.

The core technical principle of FASTER is reducing computational costs by filtering at the noise level. Traditional methods require fully denoising all action samples, whereas FASTER filters early in the denoising process, significantly reducing computational demands. Experimental results show that FASTER outperforms compared methods in long-horizon manipulation tasks, particularly in Robomimic and LIBERO tasks.

FASTER's applications extend beyond theoretical research, demonstrating potential in practical applications. By reducing computational bottlenecks, this method provides a practical solution for resource-constrained or latency-sensitive environments. Additionally, FASTER recovers the performance gains of sampling-based test-time scaling without incurring computational costs.

However, FASTER has its limitations. While it significantly improves computational efficiency, it does not enhance sample efficiency. Furthermore, the method is only applicable to policy classes using initial noise seeds and cannot be directly applied to those lacking such a structure. Future research directions include improving sample efficiency and extending to other policy classes.

Deep Analysis

Background

In recent years, significant progress has been made in the field of reinforcement learning (RL), particularly in policies using generative models such as diffusion models. These models have been widely applied in domains like image/video generation and robotics. However, the high computational cost during training and testing has become a barrier to their widespread application. This is especially true in modern Vision-Language-Action (VLA) models, where computational demands can become a bottleneck. Traditional sampling methods require sampling multiple action candidates and selecting the best one at test time, leading to high computational costs. Although distillation methods can amortize this cost by training the policy to directly reproduce high-value behaviors, training a separate policy can be expensive. Therefore, recovering the performance gains of sampling-based test-time scaling without incurring computational costs has become an important research problem.

Core Problem

Current reinforcement learning methods require sampling multiple action candidates and selecting the best one at test time, leading to high computational costs. This is particularly problematic in large models like modern Vision-Language-Action (VLA) models, where computational demands can become a bottleneck. Traditional sampling methods require fully denoising all action samples, which is impractical in resource-constrained or latency-sensitive environments. Therefore, recovering the performance gains of sampling-based test-time scaling without incurring computational costs has become an important research problem.

Innovation

The FASTER method addresses the issue of high computational costs by early filtering of action samples during the denoising process. Specifically, FASTER models the denoising of multiple action candidates as a Markov Decision Process (MDP), progressively filtering action candidates before denoising is complete. By learning a policy and value function in the denoising space, FASTER can predict the downstream value of action candidates and filter them while maximizing returns. Unlike traditional methods, FASTER filters at the noise level rather than selecting after full denoising, significantly reducing computational costs.

Methodology

The core of the FASTER method is modeling the denoising process as a Markov Decision Process (MDP) and early filtering of action candidates during the denoising process. Specific steps include:

�� Define the denoising MDP: Treat the denoising process as an MDP, where states include environment states, denoising timesteps, and partially denoised intermediates, and actions are the selection of retained candidates.

�� Learn the denoise Q-function: Use traditional temporal difference learning to learn the denoise Q-function and policy, deciding which actions to keep and remove.

�� Filtering policy: Filter at the noise level, selecting the most promising action candidates for denoising, reducing computational costs.

�� Experimental validation: Validate the method's effectiveness in challenging tasks such as Robomimic and LIBERO.

Experiments

The experimental design includes validating the effectiveness of the FASTER method in challenging tasks such as Robomimic and LIBERO. Baselines used include high-performing online RL methods such as EXPO and IDQL. Key hyperparameters used in the experiments include the number of denoising steps and the number of candidates. By comparing FASTER with its unfiltered counterparts (e.g., EXPO and IDQL), it was verified that FASTER captures the sample variance signal exploited by best-of-N selection without fully denoising all action samples.

Results

Experimental results show that FASTER outperforms compared methods in long-horizon manipulation tasks, particularly in Robomimic and LIBERO tasks. Specifically, FASTER recovers the performance gains of sampling-based test-time scaling without incurring computational costs. By filtering at the noise level, FASTER reduces computational bottlenecks, especially in large models like modern Vision-Language-Action (VLA) models. The results also show a reduction in update step time from 11.6s to 2.5s and inference time from 566ms to 335ms.

Applications

FASTER's applications extend beyond theoretical research, demonstrating potential in practical applications. By reducing computational bottlenecks, this method provides a practical solution for resource-constrained or latency-sensitive environments. Specific application scenarios include autonomous driving, complex system control, and robotic operations. FASTER can significantly improve computational efficiency and reduce computational costs in these fields.

Limitations & Outlook

While FASTER significantly improves computational efficiency, it does not enhance sample efficiency. Its performance gains largely rely on the inherent sample efficiency of the base algorithms. Furthermore, the method is only applicable to policy classes using initial noise seeds and cannot be directly applied to those lacking such a structure. In some complex tasks, FASTER may not fully replace traditional sampling methods, especially in scenarios requiring high precision. Future research directions include improving sample efficiency and extending to other policy classes.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking. Traditionally, you prepare a lot of ingredients and then try each one to see which tastes best. This is like traditional reinforcement learning methods, which require trying many actions and then selecting the best one. But this takes a lot of time and effort.

Now, the FASTER method is like being able to judge which ingredients are more likely to make a delicious dish while you're preparing them. This way, you don't have to try all the ingredients, just focus on the most promising ones. This greatly reduces your workload.

FASTER reduces computational costs by early filtering of action samples during the denoising process. It's like knowing which ingredients are the best choice before you start cooking. This not only saves time but also improves efficiency.

So, the core of the FASTER method is making wise choices early on to avoid unnecessary computation, just like selecting the best ingredients in the kitchen ahead of time.

ELI14 Explained like you're 14

Hey there, buddy! Imagine you're playing a super cool game. Usually, you need to try a lot of different strategies to find the best way to win the game. This is like traditional reinforcement learning methods, which require trying many actions and then selecting the best one.

But that can take a lot of time, right? So, scientists came up with a method called FASTER. It's like having a super helper in the game that can tell you which strategies are more likely to win before you even try them.

It's like having a magic compass in the game that tells you which direction is right. This way, you don't have to waste time on strategies that are less likely to succeed.

So, the FASTER method is like your game assistant, helping you find the best strategy faster, saving time and effort! Isn't that cool?

Glossary

Reinforcement Learning

A machine learning method where an agent learns to make decisions by interacting with an environment to maximize cumulative reward.

In this paper, reinforcement learning is used to train agents to select optimal actions in a given environment.

Denoising Process

The process of extracting useful information from noisy data.

FASTER reduces computational cost by early filtering of action samples during the denoising process.

Markov Decision Process

A mathematical model used to describe systems with randomness and decision-making processes.

FASTER models the denoising process as a Markov Decision Process to filter action samples during denoising.

Sampling Method

The process of selecting samples from a dataset.

Traditional reinforcement learning methods require sampling multiple action candidates and selecting the best one at test time.

Computational Efficiency

The ability to complete tasks with limited computational resources.

FASTER improves computational efficiency by reducing unnecessary computation.

Value Function

A function that predicts the future cumulative reward in a given state.

FASTER learns a value function in the denoising space to predict the downstream value of action candidates.

Policy

A rule or function for selecting actions in a given state.

FASTER learns a policy in the denoising space to filter action samples during denoising.

Generative RL Algorithm

A reinforcement learning algorithm that uses generative models to learn and optimize policies.

FASTER can be integrated into existing generative RL algorithms.

Sample Variance

The degree of dispersion of sample data.

FASTER captures the sample variance signal exploited by best-of-N selection by filtering at the noise level.

Vision-Language-Action Model

A multimodal model that combines vision, language, and action information.

FASTER demonstrates computational efficiency in modern Vision-Language-Action models.

Open Questions Unanswered questions from this research

1 How can the FASTER method be applied to policy classes without initial noise seeds? The current method is only applicable to policy classes using initial noise seeds and cannot be directly applied to those lacking such a structure.
2 How can the sample efficiency of the FASTER method be improved? While FASTER significantly improves computational efficiency, it does not enhance sample efficiency.
3 Can the FASTER method fully replace traditional sampling methods in complex tasks? Especially in scenarios requiring high precision, FASTER may not fully replace traditional methods.
4 How can the FASTER method be applied in other domains, such as autonomous driving and complex system control? These fields may require adaptive adjustments to the FASTER method.
5 How does the FASTER method perform in large-scale models? While it demonstrates computational efficiency in modern Vision-Language-Action models, its performance in other large-scale models needs verification.

Applications

Immediate Applications

Autonomous Driving

The FASTER method can be used in autonomous driving systems to improve real-time decision-making efficiency and accuracy by reducing computational costs.

Robotic Operations

In robotic operations, the FASTER method can help robots select optimal actions faster, improving operational efficiency.

Complex System Control

In complex systems, the FASTER method can be used for real-time control, reducing computational bottlenecks and improving system response speed.

Long-term Vision

Smart Cities

The FASTER method can be applied to the management and control of smart cities, achieving more intelligent city management by improving computational efficiency.

Medical Diagnosis

In the medical field, the FASTER method can be used for real-time diagnosis and treatment plan selection, improving the efficiency and accuracy of medical services.

Abstract

Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .

cs.LG cs.AI

References (20)

EXPO: Stable Reinforcement Learning with Expressive Policies

Perry Dong, Qiyang Li, Dorsa Sadigh et al.

2025 12 citations ⭐ Influential View Analysis →

Efficient Online Reinforcement Learning with Offline Data

Philip J. Ball, Laura M. Smith, Ilya Kostrikov et al.

2023 319 citations View Analysis →

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang et al.

2025 71 citations View Analysis →

Flow Q-Learning

Seohong Park, Qiyang Li, Sergey Levine

2025 92 citations View Analysis →

The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise

Yuanhao Ban, Ruochen Wang, Tianyi Zhou et al.

2024 17 citations View Analysis →

Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning

Andrew Wagenmaker, Perry Dong, Raymond Tsao et al.

2025 5 citations View Analysis →

FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models

Changgu Chen, Libing Yang, Xiaoyan Yang et al.

2024 16 citations View Analysis →

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Nanye Ma, Shangyuan Tong, Haolin Jia et al.

2025 204 citations View Analysis →

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner et al.

2023 261 citations View Analysis →

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, Jaehoon Lee, Kelvin Xu et al.

2024 1625 citations View Analysis →

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, S. Feng, Yilun Du et al.

2023 2860 citations View Analysis →

One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Zhendong Wang, Zhaoshuo Li, A. Mandlekar et al.

2024 58 citations View Analysis →

Q-learning with Adjoint Matching

Qiyang Li, Sergey Levine

2026 3 citations View Analysis →

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine et al.

2024 238 citations View Analysis →

Not All Noises Are Created Equally:Diffusion Noise Selection and Optimization

Zipeng Qi, Lichen Bai, Haoyi Xiong et al.

2024 59 citations View Analysis →

A Noise is Worth Diffusion Guidance

Donghoon Ahn, Jiwon Kang, Sanghyun Lee et al.

2024 38 citations View Analysis →

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Yinlam Chow, Guy Tennenholtz, Izzeddin Gur et al.

2024 58 citations View Analysis →

Policy Representation via Diffusion Probability Model for Reinforcement Learning

Long Yang, Zhixiong Huang, Fenghao Lei et al.

2023 103 citations View Analysis →

Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models

L. Eyring, Shyamgopal Karthik, Alexey Dosovitskiy et al.

2025 22 citations View Analysis →

Noise-Level Diffusion Guidance: Well Begun is Half Done

Harvey Mannering, Zhiwu Huang, Adam Prügel-Bennett

2025 3 citations View Analysis →

FASTER: Value-Guided Sampling for Fast RL

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Reinforcement Learning

Denoising Process

Markov Decision Process

Sampling Method

Computational Efficiency

Value Function

Policy

Generative RL Algorithm

Sample Variance

Vision-Language-Action Model

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Autonomous Driving

Robotic Operations

Complex System Control

Long-term Vision

Smart Cities

Medical Diagnosis

Abstract

References (20)

Related Papers

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Efficient learning by implicit exploration in bandit problems with side observations

Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data