How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures
This study reveals architecture-specific failure signatures in VLA models via black-box action monitoring, emphasizing the importance of architecture-matched monitors.
Key Findings
Methodology
The paper employs a training-free, black-box action monitoring framework called SafeContract, integrating conformal calibration and CUSUM change detection. Experiments involve three representative architectures—VQ-BeT, Diffusion Policy, and ACT—evaluated on 450 episodes across two tasks. Monitoring metrics such as reversal rate, jerk, and velocity violations are selected based on the underlying action generation mechanisms. Conformal calibration ensures statistical coverage of the metrics, while CUSUM detects abrupt changes indicative of failures. The methodology ensures architecture-specific insights by comparing failure signatures across models under identical conditions, with no access to internal model parameters, enabling a fair and scalable evaluation.
Key Results
- Reversal rate emerged as a universal predictor across all three architectures, achieving AUROC values of 0.93 (VQ-BeT), 0.79 (Diffusion), and 0.91 (ACT), with p<0.001, demonstrating its robustness and architecture independence.
- Jerk monitoring showed high predictive power for discrete-token models (AUROC=0.88) but degraded to chance levels (AUROC=0.41) for diffusion models, illustrating the influence of generation mechanisms on failure signatures.
- Velocity violations, despite being the most common safety mechanism in deployment, exhibited poor predictive performance (AUROC 0.41-0.69) across architectures, with some cases below chance, indicating their limited utility for failure prediction.
- In continuous models like Diffusion and ACT, velocity monitoring provided negligible predictive signals (AUROC=0.52 and 0.41), confirming the necessity of architecture-matched monitoring strategies.
Significance
This research provides a foundational understanding of failure signatures specific to different VLA architectures, highlighting that a one-size-fits-all monitoring approach is ineffective. By systematically identifying architecture-dependent failure modes, it guides the design of tailored safety mechanisms, crucial for deploying autonomous robots in safety-critical applications. The insights bridge the gap between model architecture and operational safety, addressing a long-standing challenge in robotics: how to reliably predict and prevent task failures in complex, real-world environments. This work paves the way for safer, more reliable autonomous systems, especially as regulatory frameworks increasingly demand continuous safety assurance.
Technical Contribution
The paper introduces SafeContract, a novel, training-free, black-box monitoring framework that leverages conformal calibration and CUSUM detection to identify unsafe actions without requiring access to internal model parameters. It systematically compares failure signatures across discrete (VQ-BeT) and continuous (Diffusion, ACT) architectures, revealing fundamental differences in their failure modes. The study demonstrates that architecture-specific monitors—such as jerk and reversal rate for discrete models, reversal rate and momentum coherence for continuous models—significantly outperform generic safety checks like velocity violations. These contributions provide a scalable, architecture-aware approach to real-time safety monitoring, with theoretical guarantees on coverage and false alarms, enabling deployment-ready safety solutions.
Novelty
This work is the first comprehensive empirical comparison of failure signatures across multiple VLA architectures within the same task environment, revealing architecture-dependent failure modes. It innovates by proposing architecture-matched monitoring strategies, combining conformal calibration with black-box detection, which do not require retraining or internal model access. The identification of reversal rate as a universal failure predictor and the detailed analysis of jerk and velocity signals across architectures represent significant advances in understanding the failure mechanisms of generative robot policies. These insights challenge the prevalent reliance on velocity-based safety checks, advocating for tailored, architecture-aware safety mechanisms.
Limitations
- The experiments are primarily conducted in simulated environments, which may not fully capture the complexities and noise present in real-world robotic systems. The transferability of the identified failure signatures needs further validation on physical robots.
- The study focuses on three main architecture types; other emerging hybrid or transformer-based models may exhibit different failure signatures, requiring additional investigation.
- The monitoring metrics are primarily designed for motion-level failures; they do not account for perception, reasoning, or semantic errors that could also lead to task failure. Integrating multi-modal and higher-level safety checks remains an open challenge.
Future Work
Future research should extend validation to real robotic platforms, exploring how these failure signatures manifest under real-world conditions. Developing adaptive, online calibration methods for conformal bounds could improve robustness against distribution shifts. Additionally, integrating perception and task-level monitoring with motion-based signatures could create comprehensive safety frameworks. Exploring hybrid architectures and more complex models will further refine the understanding of failure modes. Ultimately, combining these insights with regulatory standards will facilitate the deployment of safe, autonomous robots in diverse environments.
AI Executive Summary
The rapid advancement of autonomous robot control models, especially those leveraging deep learning, has revolutionized the field, enabling robots to perform complex tasks with unprecedented flexibility. However, ensuring the safety and reliability of these systems during real-world deployment remains a critical challenge. Traditional safety mechanisms, such as velocity limits and boundary checks, are often insufficient to detect subtle or architecture-specific failures that can lead to task failures or accidents.
Recent developments in action-space monitoring (ASM) have shown promise as a model-agnostic, external safety layer. This study introduces SafeContract, a novel, training-free monitoring toolkit that employs conformal calibration and CUSUM change detection to identify unsafe actions in real-time. Unlike prior methods that rely on internal model access or retraining, SafeContract operates purely as a black-box, making it highly scalable and adaptable.
The core innovation lies in systematically comparing three prominent VLA architectures—VQ-BeT, Diffusion Policy, and ACT—across two manipulation tasks with 450 episodes. The experiments reveal that failure signatures are architecture-dependent. For discrete models like VQ-BeT, high jerk and reversal rates serve as strong predictors, with AUROC scores exceeding 0.88. Conversely, continuous models such as Diffusion and ACT exhibit failure signatures characterized by reversal rate and momentum coherence, with jerk losing predictive power. Surprisingly, velocity violations, despite being the default safety mechanism, show limited predictive value, often below chance levels.
These findings underscore the importance of architecture-matched monitoring strategies. A universal monitor cannot effectively predict failures across diverse models. Instead, tailored approaches—monitoring jerk and reversal rate for discrete models, reversal rate and momentum coherence for continuous models—are essential for reliable failure prediction. The study demonstrates that such tailored monitoring significantly improves failure detection accuracy without degrading task success rates.
Beyond theoretical insights, this work has immediate practical implications. It provides a scalable, easy-to-deploy safety layer that can be integrated into existing robotic systems without retraining. The approach enhances safety in industrial, service, and research robots, especially as regulatory frameworks demand continuous safety assurance. Looking forward, extending validation to real robots, incorporating perception and reasoning signals, and developing adaptive calibration methods will further strengthen the safety guarantees. This research marks a significant step toward safer autonomous robots capable of operating reliably in complex, dynamic environments.
Deep Analysis
Background
The evolution of robotic control has transitioned from rule-based systems to deep learning models capable of end-to-end autonomous operation. Architectures such as VQ-VAE (vector quantized variational autoencoders), diffusion models, and autoregressive policies like ACT have demonstrated remarkable capabilities in generating complex motions from visual and linguistic inputs. Notable prior works include OpenVLA, LeRobot, and pi0, which integrate multi-modal perception with action generation. Despite these advances, the safety aspect—particularly the prediction and prevention of motion failures—has been underexplored. Traditional safety measures rely on boundary checks and velocity limits, which are often insufficient for the nuanced failure modes of modern generative models. As models grow in complexity, failure signatures become more subtle and architecture-dependent, necessitating new monitoring paradigms that can adapt to these differences without requiring access to internal model parameters.
Core Problem
The core challenge addressed in this paper is the lack of systematic understanding of how different VLA architectures fail at the motor command level. Existing safety mechanisms are largely generic and do not account for the distinct failure signatures arising from different model structures. This leads to ineffective failure prediction, risking task failure or unsafe behaviors during deployment. The problem is compounded by the fact that many current systems operate as black boxes, with limited visibility into their internal states. Therefore, developing architecture-aware, model-agnostic monitoring strategies that can reliably predict failures in real-time is crucial. The difficulty lies in identifying robust, architecture-specific failure signatures that can be monitored externally, without retraining or model access, and that can generalize across tasks and environments.
Innovation
This work introduces several key innovations. First, the SafeContract framework combines conformal calibration with CUSUM change detection to create a robust, training-free monitoring system that guarantees statistical coverage of monitored signals. Second, it systematically compares failure signatures across three distinct architectures—VQ-BeT (discrete token-based), Diffusion Policy (continuous denoising), and ACT (action chunking)—highlighting their fundamental differences. Third, it identifies that failure predictors such as reversal rate are architecture-independent, while jerk and velocity violations are architecture-dependent. Fourth, the paper proposes architecture-matched monitor selection, tailoring specific metrics to each model type, which significantly enhances failure prediction accuracy. These innovations collectively enable scalable, effective safety monitoring without internal model access, addressing a critical gap in current robotic safety research.
Methodology
- �� Selection of monitoring metrics: Based on the understanding of action generation mechanisms, five key metrics are chosen—reversal rate, jerk (second derivative of action), spectral energy, total variation, and momentum coherence.
- �� Conformal calibration: Using an 80/20 split of demonstration episodes, non-conformity scores are computed on the calibration set. Bounds are constructed to guarantee 97.9% coverage at α=0.05, ensuring statistical reliability.
- �� Real-time change detection: CUSUM detectors are applied to each metric, with thresholds set to detect significant shifts indicative of failures.
- �� Architecture-specific monitor design: For discrete models, jerk and reversal rate are prioritized; for continuous models, reversal rate and momentum coherence are emphasized.
- �� Experimental evaluation: All three architectures are tested on identical tasks, with 450 episodes each. Failure signatures are analyzed via AUROC scores, comparing the predictive power of each metric.
- �� Statistical analysis: AUROC scores are computed for each metric, and significance tests confirm the architecture dependence of failure signatures.
- �� Deployment validation: SafeContract enforces safety bounds during execution, with violations logged and analyzed to assess the impact on task success.
Experiments
The experiments are conducted on two main platforms: PushT, a 2D manipulation task, and ALOHA, a 14-DOF bimanual manipulation task. All models—VQ-BeT, Diffusion Policy, and ACT—are evaluated under identical conditions, with 200 episodes per architecture for PushT and 50 for ALOHA. The models are tested with the same random seeds and safety bounds, ensuring comparability. SafeContract monitors are integrated without retraining, providing real-time detection of violations. Failure signatures are analyzed through AUROC metrics across multiple monitoring signals. The experiments also include ablation studies, removing or replacing specific metrics to assess their contribution. Results demonstrate architecture-dependent failure signatures and the effectiveness of tailored monitoring strategies in predicting task failures.
Results
The experimental results confirm that reversal rate is a robust, architecture-independent predictor, with AUROC scores exceeding 0.79 across all models. Jerk monitoring is highly predictive for VQ-BeT (AUROC=0.88) but drops to chance levels for diffusion models. Velocity violations, despite being the default safety check, show AUROC scores below 0.7 and sometimes below chance, indicating poor failure prediction capability. The tailored, architecture-matched monitoring strategies significantly outperform generic approaches, achieving AUROC scores above 0.9 for key failure signatures. The experiments also reveal qualitative differences: VQ-BeT exhibits jerky, oscillatory failures, while diffusion models tend to stall with smooth trajectories. These insights validate the importance of architecture-aware monitoring for safe deployment.
Applications
The proposed monitoring framework can be directly integrated into existing robotic control systems, especially those employing deep generative models. It enables real-time failure prediction without modifying the underlying models, making it suitable for industrial automation, service robots, and autonomous vehicles. By automatically selecting architecture-specific monitors, it reduces the need for manual tuning and expert intervention. Additionally, the framework can be used in robot debugging and maintenance, providing insights into failure modes and guiding model improvements. In the long term, combining such motion-level safety monitoring with perception and task-level checks can lead to comprehensive, multi-layered safety systems capable of operating reliably in complex, dynamic environments.
Limitations & Outlook
The current experiments are limited to simulated environments, which may not fully replicate real-world noise, sensor inaccuracies, and environmental variability. The generalization of identified failure signatures to physical robots remains to be validated. The study focuses on three main architecture types; other emerging models, such as hybrid or transformer-based architectures, may exhibit different failure modes. The monitoring metrics primarily target motion-level failures, leaving perception, reasoning, and semantic errors unaddressed. Additionally, the approach assumes static safety bounds, which may need adaptation for dynamic or uncertain environments. Future work should include real-robot validation, adaptive calibration methods, and multi-modal safety integration to address these limitations.
Abstract
We discover that VLA architectures fail in fundamentally different, predictable ways at the motor-command level. Running VQ-BeT, Diffusion Policy, and ACT on identical evaluation protocols (n=450 episodes across PushT and ALOHA 14-DOF bimanual manipulation), we find: (1) direction reversal rate is a universal failure predictor across all three architectures (AUROC=0.93, 0.79, 0.91; p<0.001); (2) jerk monitoring is predictive only for discrete-token architectures, following a discrete-to-continuous gradient (0.88, 0.69, 0.41); (3) velocity violations alone are non-predictive everywhere (AUROC 0.41-0.69), yet velocity checking is the most common safety mechanism in VLA deployment code; and (4) for continuous-family VLAs, velocity monitoring provides effectively zero predictive signal (AUROC=0.52 on ACT, 0.41 on Diffusion), proving that architecture-matched monitor selection is essential. These results quantify a monitoring consequence of the well-known discrete/continuous VLA distinction: the two families produce qualitatively different failure signatures that require different monitors. No single monitor works universally; architecture-matched selection is required. This finding was enabled by SafeContract, a training-free, black-box action monitoring toolkit with conformal calibration. Code: https://github.com/krishnam94/vla-edge
References (20)
π0: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess et al.
On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
Jiani Guo, Zhen Wu, Changhe Tu et al.
Behavior Generation with Latent Actions
Seungjae Lee, Yibin Wang, Haritheja Etukuru et al.
Towards Safe Robot Foundation Models Using Inductive Biases
Maximilian Tolle, Theo Gruner, Daniel Palenicek et al.
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, S. Feng, Yilun Du et al.
CONTINUOUS INSPECTION SCHEMES
E. S. Page
Failure Prediction at Runtime for Generative Robot Policies
Ralf Römer, Adrian Kobras, Luca Worbis et al.
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Zhao, Vikash Kumar, S. Levine et al.
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.
SAFE: Multitask Failure Detection for Vision-Language-Action Models
Qiao Gu, Yuanliang Ju, Shengxiang Sun et al.
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models
Wei Xiao, Tsun-Hsuan Wang, Chuang Gan et al.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor, Dana Aubakirova, Francesco Capuano et al.
Modular Safety Guardrails Are Necessary for Foundation-Model-Enabled Robots in the Real World
Joonkyung Kim, Wenxi Chen, Davood Soleymanzadeh et al.
Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress
Christopher Agia, Rohan Sinha, Jingyun Yang et al.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
Borong Zhang, Yuhao Zhang, Jiaming Ji et al.
Conformal Safety Monitoring for Flight Testing: A Case Study in Data-Driven Safety Learning
Aaron O. Feldman, D. Harp, Joseph Duncan et al.
Adaptive Conformal Inference Under Distribution Shift
Isaac Gibbs, E. Candès
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal et al.
VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer
Song Hu, Zeyi Liu, Shuang Liu et al.
Algorithmic Learning in a Random World
Vladimir Vovk, A. Gammerman, G. Shafer