How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures

TL;DR

This study reveals architecture-specific failure signatures in VLA models via black-box action monitoring, emphasizing the importance of architecture-matched monitors.

cs.RO 🔴 Advanced 2026-05-28 120 views

Krishnam Gupta

AI Reader Arxiv Page Download PDF

robot control action monitoring architecture analysis safety deep learning

Key Findings

Methodology

The paper employs a training-free, black-box action monitoring framework called SafeContract, integrating conformal calibration and CUSUM change detection. Experiments involve three representative architectures—VQ-BeT, Diffusion Policy, and ACT—evaluated on 450 episodes across two tasks. Monitoring metrics such as reversal rate, jerk, and velocity violations are selected based on the underlying action generation mechanisms. Conformal calibration ensures statistical coverage of the metrics, while CUSUM detects abrupt changes indicative of failures. The methodology ensures architecture-specific insights by comparing failure signatures across models under identical conditions, with no access to internal model parameters, enabling a fair and scalable evaluation.

Key Results

Reversal rate emerged as a universal predictor across all three architectures, achieving AUROC values of 0.93 (VQ-BeT), 0.79 (Diffusion), and 0.91 (ACT), with p<0.001, demonstrating its robustness and architecture independence.
Jerk monitoring showed high predictive power for discrete-token models (AUROC=0.88) but degraded to chance levels (AUROC=0.41) for diffusion models, illustrating the influence of generation mechanisms on failure signatures.
Velocity violations, despite being the most common safety mechanism in deployment, exhibited poor predictive performance (AUROC 0.41-0.69) across architectures, with some cases below chance, indicating their limited utility for failure prediction.
In continuous models like Diffusion and ACT, velocity monitoring provided negligible predictive signals (AUROC=0.52 and 0.41), confirming the necessity of architecture-matched monitoring strategies.

Significance

This research provides a foundational understanding of failure signatures specific to different VLA architectures, highlighting that a one-size-fits-all monitoring approach is ineffective. By systematically identifying architecture-dependent failure modes, it guides the design of tailored safety mechanisms, crucial for deploying autonomous robots in safety-critical applications. The insights bridge the gap between model architecture and operational safety, addressing a long-standing challenge in robotics: how to reliably predict and prevent task failures in complex, real-world environments. This work paves the way for safer, more reliable autonomous systems, especially as regulatory frameworks increasingly demand continuous safety assurance.

Technical Contribution

The paper introduces SafeContract, a novel, training-free, black-box monitoring framework that leverages conformal calibration and CUSUM detection to identify unsafe actions without requiring access to internal model parameters. It systematically compares failure signatures across discrete (VQ-BeT) and continuous (Diffusion, ACT) architectures, revealing fundamental differences in their failure modes. The study demonstrates that architecture-specific monitors—such as jerk and reversal rate for discrete models, reversal rate and momentum coherence for continuous models—significantly outperform generic safety checks like velocity violations. These contributions provide a scalable, architecture-aware approach to real-time safety monitoring, with theoretical guarantees on coverage and false alarms, enabling deployment-ready safety solutions.

Novelty

This work is the first comprehensive empirical comparison of failure signatures across multiple VLA architectures within the same task environment, revealing architecture-dependent failure modes. It innovates by proposing architecture-matched monitoring strategies, combining conformal calibration with black-box detection, which do not require retraining or internal model access. The identification of reversal rate as a universal failure predictor and the detailed analysis of jerk and velocity signals across architectures represent significant advances in understanding the failure mechanisms of generative robot policies. These insights challenge the prevalent reliance on velocity-based safety checks, advocating for tailored, architecture-aware safety mechanisms.

Limitations

The experiments are primarily conducted in simulated environments, which may not fully capture the complexities and noise present in real-world robotic systems. The transferability of the identified failure signatures needs further validation on physical robots.
The study focuses on three main architecture types; other emerging hybrid or transformer-based models may exhibit different failure signatures, requiring additional investigation.
The monitoring metrics are primarily designed for motion-level failures; they do not account for perception, reasoning, or semantic errors that could also lead to task failure. Integrating multi-modal and higher-level safety checks remains an open challenge.

Future Work

Future research should extend validation to real robotic platforms, exploring how these failure signatures manifest under real-world conditions. Developing adaptive, online calibration methods for conformal bounds could improve robustness against distribution shifts. Additionally, integrating perception and task-level monitoring with motion-based signatures could create comprehensive safety frameworks. Exploring hybrid architectures and more complex models will further refine the understanding of failure modes. Ultimately, combining these insights with regulatory standards will facilitate the deployment of safe, autonomous robots in diverse environments.

AI Executive Summary

The rapid advancement of autonomous robot control models, especially those leveraging deep learning, has revolutionized the field, enabling robots to perform complex tasks with unprecedented flexibility. However, ensuring the safety and reliability of these systems during real-world deployment remains a critical challenge. Traditional safety mechanisms, such as velocity limits and boundary checks, are often insufficient to detect subtle or architecture-specific failures that can lead to task failures or accidents.

Recent developments in action-space monitoring (ASM) have shown promise as a model-agnostic, external safety layer. This study introduces SafeContract, a novel, training-free monitoring toolkit that employs conformal calibration and CUSUM change detection to identify unsafe actions in real-time. Unlike prior methods that rely on internal model access or retraining, SafeContract operates purely as a black-box, making it highly scalable and adaptable.

The core innovation lies in systematically comparing three prominent VLA architectures—VQ-BeT, Diffusion Policy, and ACT—across two manipulation tasks with 450 episodes. The experiments reveal that failure signatures are architecture-dependent. For discrete models like VQ-BeT, high jerk and reversal rates serve as strong predictors, with AUROC scores exceeding 0.88. Conversely, continuous models such as Diffusion and ACT exhibit failure signatures characterized by reversal rate and momentum coherence, with jerk losing predictive power. Surprisingly, velocity violations, despite being the default safety mechanism, show limited predictive value, often below chance levels.

These findings underscore the importance of architecture-matched monitoring strategies. A universal monitor cannot effectively predict failures across diverse models. Instead, tailored approaches—monitoring jerk and reversal rate for discrete models, reversal rate and momentum coherence for continuous models—are essential for reliable failure prediction. The study demonstrates that such tailored monitoring significantly improves failure detection accuracy without degrading task success rates.

Beyond theoretical insights, this work has immediate practical implications. It provides a scalable, easy-to-deploy safety layer that can be integrated into existing robotic systems without retraining. The approach enhances safety in industrial, service, and research robots, especially as regulatory frameworks demand continuous safety assurance. Looking forward, extending validation to real robots, incorporating perception and reasoning signals, and developing adaptive calibration methods will further strengthen the safety guarantees. This research marks a significant step toward safer autonomous robots capable of operating reliably in complex, dynamic environments.

Deep Analysis

Background

The evolution of robotic control has transitioned from rule-based systems to deep learning models capable of end-to-end autonomous operation. Architectures such as VQ-VAE (vector quantized variational autoencoders), diffusion models, and autoregressive policies like ACT have demonstrated remarkable capabilities in generating complex motions from visual and linguistic inputs. Notable prior works include OpenVLA, LeRobot, and pi0, which integrate multi-modal perception with action generation. Despite these advances, the safety aspect—particularly the prediction and prevention of motion failures—has been underexplored. Traditional safety measures rely on boundary checks and velocity limits, which are often insufficient for the nuanced failure modes of modern generative models. As models grow in complexity, failure signatures become more subtle and architecture-dependent, necessitating new monitoring paradigms that can adapt to these differences without requiring access to internal model parameters.

Core Problem

The core challenge addressed in this paper is the lack of systematic understanding of how different VLA architectures fail at the motor command level. Existing safety mechanisms are largely generic and do not account for the distinct failure signatures arising from different model structures. This leads to ineffective failure prediction, risking task failure or unsafe behaviors during deployment. The problem is compounded by the fact that many current systems operate as black boxes, with limited visibility into their internal states. Therefore, developing architecture-aware, model-agnostic monitoring strategies that can reliably predict failures in real-time is crucial. The difficulty lies in identifying robust, architecture-specific failure signatures that can be monitored externally, without retraining or model access, and that can generalize across tasks and environments.

Innovation

This work introduces several key innovations. First, the SafeContract framework combines conformal calibration with CUSUM change detection to create a robust, training-free monitoring system that guarantees statistical coverage of monitored signals. Second, it systematically compares failure signatures across three distinct architectures—VQ-BeT (discrete token-based), Diffusion Policy (continuous denoising), and ACT (action chunking)—highlighting their fundamental differences. Third, it identifies that failure predictors such as reversal rate are architecture-independent, while jerk and velocity violations are architecture-dependent. Fourth, the paper proposes architecture-matched monitor selection, tailoring specific metrics to each model type, which significantly enhances failure prediction accuracy. These innovations collectively enable scalable, effective safety monitoring without internal model access, addressing a critical gap in current robotic safety research.

Methodology

�� Selection of monitoring metrics: Based on the understanding of action generation mechanisms, five key metrics are chosen—reversal rate, jerk (second derivative of action), spectral energy, total variation, and momentum coherence.
�� Conformal calibration: Using an 80/20 split of demonstration episodes, non-conformity scores are computed on the calibration set. Bounds are constructed to guarantee 97.9% coverage at α=0.05, ensuring statistical reliability.
�� Real-time change detection: CUSUM detectors are applied to each metric, with thresholds set to detect significant shifts indicative of failures.
�� Architecture-specific monitor design: For discrete models, jerk and reversal rate are prioritized; for continuous models, reversal rate and momentum coherence are emphasized.
�� Experimental evaluation: All three architectures are tested on identical tasks, with 450 episodes each. Failure signatures are analyzed via AUROC scores, comparing the predictive power of each metric.
�� Statistical analysis: AUROC scores are computed for each metric, and significance tests confirm the architecture dependence of failure signatures.
�� Deployment validation: SafeContract enforces safety bounds during execution, with violations logged and analyzed to assess the impact on task success.

Experiments

The experiments are conducted on two main platforms: PushT, a 2D manipulation task, and ALOHA, a 14-DOF bimanual manipulation task. All models—VQ-BeT, Diffusion Policy, and ACT—are evaluated under identical conditions, with 200 episodes per architecture for PushT and 50 for ALOHA. The models are tested with the same random seeds and safety bounds, ensuring comparability. SafeContract monitors are integrated without retraining, providing real-time detection of violations. Failure signatures are analyzed through AUROC metrics across multiple monitoring signals. The experiments also include ablation studies, removing or replacing specific metrics to assess their contribution. Results demonstrate architecture-dependent failure signatures and the effectiveness of tailored monitoring strategies in predicting task failures.

Results

The experimental results confirm that reversal rate is a robust, architecture-independent predictor, with AUROC scores exceeding 0.79 across all models. Jerk monitoring is highly predictive for VQ-BeT (AUROC=0.88) but drops to chance levels for diffusion models. Velocity violations, despite being the default safety check, show AUROC scores below 0.7 and sometimes below chance, indicating poor failure prediction capability. The tailored, architecture-matched monitoring strategies significantly outperform generic approaches, achieving AUROC scores above 0.9 for key failure signatures. The experiments also reveal qualitative differences: VQ-BeT exhibits jerky, oscillatory failures, while diffusion models tend to stall with smooth trajectories. These insights validate the importance of architecture-aware monitoring for safe deployment.

Applications

The proposed monitoring framework can be directly integrated into existing robotic control systems, especially those employing deep generative models. It enables real-time failure prediction without modifying the underlying models, making it suitable for industrial automation, service robots, and autonomous vehicles. By automatically selecting architecture-specific monitors, it reduces the need for manual tuning and expert intervention. Additionally, the framework can be used in robot debugging and maintenance, providing insights into failure modes and guiding model improvements. In the long term, combining such motion-level safety monitoring with perception and task-level checks can lead to comprehensive, multi-layered safety systems capable of operating reliably in complex, dynamic environments.

Limitations & Outlook

The current experiments are limited to simulated environments, which may not fully replicate real-world noise, sensor inaccuracies, and environmental variability. The generalization of identified failure signatures to physical robots remains to be validated. The study focuses on three main architecture types; other emerging models, such as hybrid or transformer-based architectures, may exhibit different failure modes. The monitoring metrics primarily target motion-level failures, leaving perception, reasoning, and semantic errors unaddressed. Additionally, the approach assumes static safety bounds, which may need adaptation for dynamic or uncertain environments. Future work should include real-robot validation, adaptive calibration methods, and multi-modal safety integration to address these limitations.

Abstract

We discover that VLA architectures fail in fundamentally different, predictable ways at the motor-command level. Running VQ-BeT, Diffusion Policy, and ACT on identical evaluation protocols (n=450 episodes across PushT and ALOHA 14-DOF bimanual manipulation), we find: (1) direction reversal rate is a universal failure predictor across all three architectures (AUROC=0.93, 0.79, 0.91; p<0.001); (2) jerk monitoring is predictive only for discrete-token architectures, following a discrete-to-continuous gradient (0.88, 0.69, 0.41); (3) velocity violations alone are non-predictive everywhere (AUROC 0.41-0.69), yet velocity checking is the most common safety mechanism in VLA deployment code; and (4) for continuous-family VLAs, velocity monitoring provides effectively zero predictive signal (AUROC=0.52 on ACT, 0.41 on Diffusion), proving that architecture-matched monitor selection is essential. These results quantify a monitoring consequence of the well-known discrete/continuous VLA distinction: the two families produce qualitatively different failure signatures that require different monitors. No single monitor works universally; architecture-matched selection is required. This finding was enabled by SafeContract, a training-free, black-box action monitoring toolkit with conformal calibration. Code: https://github.com/krishnam94/vla-edge

cs.RO cs.LG

References (20)

π0: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess et al.

2024 1691 citations View Analysis →

On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

Jiani Guo, Zhen Wu, Changhe Tu et al.

2025 7 citations View Analysis →

Behavior Generation with Latent Actions

Seungjae Lee, Yibin Wang, Haritheja Etukuru et al.

2024 168 citations View Analysis →

Towards Safe Robot Foundation Models Using Inductive Biases

Maximilian Tolle, Theo Gruner, Daniel Palenicek et al.

2025 3 citations View Analysis →

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, S. Feng, Yilun Du et al.

2023 3150 citations View Analysis →

CONTINUOUS INSPECTION SCHEMES

E. S. Page

1954 5754 citations

Failure Prediction at Runtime for Generative Robot Policies

Ralf Römer, Adrian Kobras, Luca Worbis et al.

2025 12 citations View Analysis →

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Zhao, Vikash Kumar, S. Levine et al.

2023 1694 citations View Analysis →

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.

2024 2203 citations View Analysis →

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Qiao Gu, Yuanliang Ju, Shengxiang Sun et al.

2025 34 citations View Analysis →

SafeDiffuser: Safe Planning with Diffusion Probabilistic Models

Wei Xiao, Tsun-Hsuan Wang, Chuang Gan et al.

2023 79 citations View Analysis →

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano et al.

2025 291 citations View Analysis →

Modular Safety Guardrails Are Necessary for Foundation-Model-Enabled Robots in the Real World

Joonkyung Kim, Wenxi Chen, Davood Soleymanzadeh et al.

2026 4 citations View Analysis →

Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress

Christopher Agia, Rohan Sinha, Jingyun Yang et al.

2024 42 citations View Analysis →

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

Borong Zhang, Yuhao Zhang, Jiaming Ji et al.

2025 31 citations View Analysis →

Conformal Safety Monitoring for Flight Testing: A Case Study in Data-Driven Safety Learning

Aaron O. Feldman, D. Harp, Joseph Duncan et al.

2025 1 citations View Analysis →

Adaptive Conformal Inference Under Distribution Shift

Isaac Gibbs, E. Candès

2021 436 citations View Analysis →

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal et al.

2023 3094 citations View Analysis →

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Song Hu, Zeyi Liu, Shuang Liu et al.

2025 13 citations View Analysis →

Algorithmic Learning in a Random World

Vladimir Vovk, A. Gammerman, G. Shafer

2005 2143 citations

How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

References (20)

Related Papers

Increasing Resilience of Continuum Robots via Motion Planning Algorithms

ARC: Adaptive Robust Joint State and Covariance Estimation

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies