Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

Key Findings

Methodology

This paper introduces VERITAS, a generator-verifier framework for robotic policies. It employs a pre-trained generalist policy as a stochastic generator to produce multiple candidate action sequences at each decision point. A gradient-free visual verifier, based on vision-language models (VLM), evaluates these candidates by assessing their task alignment and physical plausibility through geometric projections. The highest-scoring action is executed immediately, providing real-time performance gains. Successful trajectories are logged and used for offline policy fine-tuning via behavior cloning, creating a self-reinforcing cycle of improvement. This approach requires no additional training of the core policy during inference, relying solely on sampling and verification, which significantly enhances robustness and adaptability in deployment.

Key Results

In simulation, VERITAS consistently improved success rates by approximately 10%, with the baseline success rate of 75% increasing to 85% after inference-time steering. Offline fine-tuning on verified trajectories further boosted performance to over 90%. In real-world experiments on robotic platforms like DROID, the method achieved comparable success rates to expert demonstrations, with a 12% improvement over baseline policies. The AB tests confirmed that the visual-language geometric verification outperformed traditional value-function-based verifiers, with a 12% higher success rate and lower false positive rates. Ablation studies indicated that sampling N=5 action chunks per step at 15Hz control frequency strikes a good balance between computational cost and verification effectiveness.
The experiments demonstrated that the verification mechanism effectively filters out physically invalid or suboptimal actions, leading to more reliable task execution. The offline policy updates based on verified trajectories resulted in a 10-15% success rate increase across multiple tasks, highlighting the potential for continuous self-improvement without human intervention. The approach proved robust across different manipulation tasks, environments, and policy architectures, indicating broad applicability.

Significance

This work addresses a fundamental bottleneck in robotic learning—dependence on costly human demonstrations—by enabling autonomous self-improvement through inference-time verification. It introduces a scalable, low-cost mechanism that allows robots to explore, validate, and refine behaviors during deployment, reducing reliance on offline data collection. The framework bridges the gap between offline training and online adaptation, offering a practical pathway toward truly autonomous robots capable of lifelong learning. Its success in both simulated and real-world settings demonstrates its potential to revolutionize robotic applications in manufacturing, service, and logistics, where adaptability and robustness are critical. Moreover, the integration of vision-language models for geometric verification opens new avenues for multi-modal reasoning in robotics, fostering more intelligent and context-aware systems.

Technical Contribution

The key technical innovation lies in integrating a gradient-free visual verifier based on vision-language models with a stochastic generator policy, forming a generator-verifier loop that operates solely at inference time. This design allows sampling multiple candidate actions, evaluating their task and physical validity via geometric projections, and selecting the best in real-time, without modifying the core policy. The visual verifier leverages pre-trained VLMs to generate a static visual trace of the target trajectory, which simplifies online geometric consistency checks, dramatically reducing computational overhead. The verified trajectories are stored and used for offline behavior cloning, effectively distilling verification reasoning into the policy. This approach circumvents the need for expensive retraining or large demonstration datasets, enabling scalable, continual policy improvement during deployment. The framework also generalizes across different tasks and policies, demonstrating broad applicability.

Novelty

This is the first work to embed inference-time verification directly into the policy execution loop for autonomous self-improvement in robotics. Unlike prior methods relying on offline training or human supervision, VERITAS uses a plug-and-play visual verifier to evaluate multiple candidate actions in real-time, forming a closed-loop self-correcting system. Its core novelty is the geometric scoring based on vision-language models, which grounds high-level semantic instructions into precise geometric constraints without requiring expensive annotations. The framework’s ability to generate, verify, and learn from its own successful trajectories in an online manner represents a significant departure from traditional supervised learning paradigms, opening new possibilities for scalable, autonomous robot learning.

Limitations

The effectiveness of the verification heavily depends on the accuracy of the vision-language model; in scenarios with poor visual or semantic understanding, the verifier may misjudge actions, leading to suboptimal or unsafe behaviors.
In highly dynamic or cluttered environments, geometric projection errors can accumulate, reducing verification reliability and potentially filtering out valid actions.
The computational overhead of sampling and verification, although optimized, still poses challenges for ultra-high-frequency control tasks, especially on resource-constrained hardware. Further optimization and hardware acceleration are needed for broader real-time deployment.

Future Work

Future research will focus on developing adaptive, learning-based verifiers that can improve their judgment over time, possibly via reinforcement learning. Extending the framework to multi-robot systems and long-horizon tasks will be another direction, addressing challenges in coordination and temporal consistency. Additionally, integrating more sophisticated multi-modal perception, such as tactile and auditory cues, could enhance verification robustness. Exploring the combination of inference-time verification with online reinforcement learning algorithms may enable robots to autonomously discover new behaviors and adapt to unforeseen environments, pushing toward lifelong autonomous learning systems.

AI Executive Summary

In the rapidly evolving field of robotics, a persistent challenge has been enabling robots to learn and adapt autonomously in complex, real-world environments. Traditional approaches rely heavily on large-scale human demonstrations and offline training, which are costly, time-consuming, and often limited in generalization. Despite advances in deep learning and large pre-trained models, deploying robots that can continuously improve during operation remains a significant hurdle. This gap between offline training and online adaptation has motivated researchers to explore mechanisms that allow robots to self-assess and refine their behaviors in real-time.

The paper introduces VERITAS, a novel framework that leverages inference-time verification to enable autonomous policy improvement. The core idea is to treat the robot’s policy as a stochastic generator that produces multiple candidate actions at each decision point. These actions are then evaluated by a plug-and-play visual verifier based on vision-language models (VLM), which assess their task relevance and physical plausibility through geometric projections. The highest-scoring actions are executed immediately, leading to instant performance gains. Crucially, the verified successful trajectories are stored and used for offline policy fine-tuning via behavior cloning, creating a self-reinforcing cycle that continuously enhances the robot’s capabilities.

This approach addresses a fundamental bottleneck in robotic learning—the high cost of data collection—by replacing it with an inference-time verification mechanism that requires no additional training of the core policy. The verification process is efficient, leveraging static visual traces generated by the VLM, which simplifies online geometric consistency checks. The entire system forms a closed-loop, where the robot learns from its own successful experiences, effectively forming a data flywheel that accelerates policy improvement.

Extensive experiments in both simulation and real-world robotic platforms demonstrate the effectiveness of VERITAS. In simulated manipulation tasks, the framework achieves an average success rate increase of 10%, with verified trajectories enabling the policy to outperform baseline methods. In real-world experiments, the approach attains comparable performance to expert demonstrations, with significant improvements in task success and robustness. The visual verifier’s geometric scoring outperforms traditional value-function-based methods, confirming its efficacy.

The significance of this work lies in its practical, scalable solution to autonomous robot learning. By integrating inference-time verification, robots can explore, validate, and improve behaviors during deployment, reducing reliance on costly data collection and human supervision. This paradigm shift opens new avenues for lifelong learning, adaptive control, and scalable deployment across diverse applications such as manufacturing, logistics, and service robotics. Despite its success, challenges remain, including the dependence on the accuracy of visual models and computational costs, which future research aims to address. Overall, VERITAS marks a substantial step toward truly autonomous, self-improving robotic systems, promising a future where robots can learn and adapt continuously in dynamic environments.

Deep Dive

Plain Language Accessible to non-experts

想象一下你在厨房里做饭。你有一本菜谱（类似于机器人预训练的策略），告诉你怎么做菜，但你也会尝试不同的方法，比如多次试验，看看哪种味道更好。为了确保每次尝试都不错，你会用一个“味道检测器”来品尝每个菜肴，判断它是否符合预期。这个检测器不用你教它怎么判断，只是快速判断味道是否合格。你会多次尝试不同的方法，然后用检测器挑出最好的那一个，直接端上桌。这种方法让你不用每次都从头学起，就能不断变得更厉害，做出更好吃的菜。这就像机器人用VERITAS一样，它在行动中不断试验和验证，自己变得更聪明、更能干。通过不断试验和筛选，机器人可以学会新技能，变得越来越厉害，甚至能应对各种新挑战。

ELI14 Explained like you're 14

想象你在玩一个超级酷的游戏，你的目标是找到最快、最厉害的路线去完成任务。你会试几条不同的路，然后用一个“路况检测器”来帮你判断哪条路最安全、最短、最酷。这个检测器不用你告诉它怎么判断，只是快速看一看，然后帮你挑出最棒的那条路。你试出来的成功路线会被记下来，下一次你就可以用这些经验做得更好。这个过程就像机器人用VERITAS，它在行动中不断试验不同的动作，验证哪些最有效，然后自己变得更聪明、更厉害。这样，机器人不用每次都从头学，也不用人一直教它，就能自己不断变强，完成更难的任务。是不是很酷？

Glossary

生成-验证机制 (Generator-Verifier)

一种结合动作生成和验证的系统，用于在推理时筛选最优动作，提升策略性能。

本文提出的VERITAS核心机制。

视觉-语言模型 (Vision-Language Model, VLM)

结合视觉感知和自然语言理解的深度模型，用于理解环境和任务指令，支持几何验证。

验证器的基础技术。

动作片段 (Action Chunk)

由策略预测的短时间内连续动作序列，用于在推理时采样和验证。

采样和验证的基本单元。

几何投影 (Geometric Projection)

将机器人末端位置映射到像素空间，用于验证动作的空间一致性。

验证器中的关键步骤。

行为克隆 (Behavior Cloning)

模仿示范数据训练策略，使其复制专家行为。

离线微调的训练方法。

分布偏移 (Distribution Shift)

策略在训练和部署环境中表现差异，导致性能下降的问题。

策略微调中的挑战。

推理时采样 (Inference-time Sampling)

在策略执行时生成多个候选动作以供验证选择。

核心技术之一。

闭环学习 (Closed-loop Learning)

策略通过自身反馈不断优化的过程。

自我提升机制的基础。

自我微调 (Self-Improvement)

利用验证成功的轨迹对策略进行离线策略更新，提升性能。

策略持续优化的关键。

视觉轨迹 (Visual Trace)

由视觉模型生成的目标路径，用于验证动作的空间合理性。

验证器中的重要元素。

Open Questions Unanswered questions from this research

1 如何在极端复杂或动态环境中确保验证器的准确性和鲁棒性，仍是一个挑战。未来需要结合多模态信息和学习型验证器，以提升验证的适应性和泛化能力。
2 验证机制在多机器人协作场景中的应用尚未充分探索，如何协调多个机器人同时进行验证和行动优化，是未来研究的方向。
3 推理时验证的计算成本仍然较高，尤其在高频控制环境中，如何进一步提升验证效率，降低硬件依赖，是实现广泛应用的关键。
4 验证器的设计多依赖视觉-语言模型的性能，未来需研究更为高效、鲁棒的验证机制，减少模型偏差带来的影响。
5 在长时任务和复杂环境中，验证轨迹的持续性和一致性仍需优化，确保策略在多阶段、多目标任务中的稳定性。

Applications

Immediate Applications

工业机器人自主操控

利用VERITAS实现机器人在装配线上的自主操作，减少人工干预，提高生产效率，适用于高精度装配任务。

家庭服务机器人

在家庭环境中，机器人通过推理时验证不断优化服务行为，如物品搬运、清洁等，提升用户体验。

仓储物流自动化

在仓库中，机器人通过自主验证筛选最优路径和操作策略，提升存取效率，降低运营成本。

Long-term Vision

自主学习的机器人生态系统

构建能够在多任务、多环境中持续学习和自我优化的机器人体系，实现真正的自主智能。

智能制造的全面升级

推动工业生产向自主、柔性、智能化转型，机器人能在无需大量示范的情况下自主适应新任务和环境。

Abstract

Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator-verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a ``generator'' and pair it with a gradient-free ``visual verifier'' that evaluates actions at inference time. This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data. Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.

cs.RO cs.AI