RedVLA: Physical Red Teaming for Vision-Language-Action Models

TL;DR

RedVLA identifies physical safety risks in VLA models through a two-stage process, achieving an ASR of 95.5%.

cs.RO 🔴 Advanced 2026-04-24 29 views

Yuhao Zhang Borong Zhang Jiaming Fan Jiachen Shen Yishuai Cai Yaodong Yang Jiaming Ji

physical safety red teaming VLA models risk amplification safety guard

Key Findings

Methodology

RedVLA is a red teaming framework for the physical safety of Vision-Language-Action (VLA) models. It systematically uncovers unsafe behaviors through a two-stage process. The first stage, Risk Scenario Synthesis, identifies critical interaction regions and positions risk factors within these regions to entangle them with the VLA's execution flow and elicit target unsafe behaviors. The second stage, Risk Amplification, ensures stable elicitation across heterogeneous models by iteratively refining the risk factor state through gradient-free optimization guided by trajectory features.

Key Results

Experiments show that RedVLA uncovers diverse unsafe behaviors across six representative VLA models, achieving an ASR of up to 95.5% within 10 optimization iterations. These models include OpenVLA, OpenVLA-OFT, VLA-Adapter, VLA-Adapter-Pro, π0, and π0.5.
RedVLA proactively elicits unsafe behaviors at state, cumulative, and conditional levels, achieving an average ASR of 92.7%, outperforming OpenVLA.
SimpleVLA-Guard, built from RedVLA-generated data, reduces online ASR by 59.5% with minimal impact on task performance.

Significance

The introduction of RedVLA fills a gap in the physical safety assessment of VLA models, providing a prerequisite for large-scale real-world deployment. By systematically introducing potential risk factors and uncovering unsafe behaviors, RedVLA maximizes physical safety risks without compromising the benign nature of the original scene and the semantic consistency of task instructions. This framework not only offers new research directions for academia but also provides an essential safety evaluation tool for industry when deploying VLA models.

Technical Contribution

RedVLA introduces a novel red teaming framework focused on the physical safety of VLA models. Unlike existing methods, RedVLA considers potential physical risks within the environment and models the physical causality of the risk process through gradient-free optimization. This approach enables RedVLA to uncover unique safety risks of VLA models in the physical world without compromising task feasibility.

Novelty

RedVLA is the first systematic exploration of red teaming for the physical safety of VLA models. Unlike previous methods that primarily target semantic and intent vulnerabilities, RedVLA shifts the source of risks from the intent space to the physical space, introducing a novel risk amplification method that ensures stable elicitation of unsafe behaviors across heterogeneous models.

Limitations

RedVLA's risk factor initialization depends on the agent's initial state and instruction, which may lead to instability in eliciting target unsafe behaviors in certain scenarios.
The framework's optimization process requires multiple iterations, which may not be efficient when computational resources are limited.
SimpleVLA-Guard's cross-task generalization capability is limited, with detection performance dropping on unseen tasks.

Future Work

Future research directions include enhancing RedVLA's generalization across different tasks and environments, optimizing the risk amplification process to reduce computational overhead, and further improving SimpleVLA-Guard's cross-task detection and intervention capabilities. Additionally, exploring the application of RedVLA to other types of multimodal models to assess their safety in a broader range of AI systems is a promising avenue.

AI Executive Summary

Vision-Language-Action (VLA) models hold significant potential for applications in critical domains such as robotic manipulation, autonomous driving, and surgical robotics. However, their deployment in the real world is limited by the risk of unpredictable and irreversible physical harm. Existing red teaming methods primarily target semantic and intent vulnerabilities, overlooking the unique safety challenges posed by VLA models in the physical space.

To address this issue, Yuhao Zhang and colleagues have introduced RedVLA, the first red teaming framework for the physical safety of VLA models. RedVLA systematically uncovers unsafe behaviors through a two-stage process: first, Risk Scenario Synthesis identifies critical interaction regions and positions risk factors within these regions; second, Risk Amplification iteratively refines the risk factor state through gradient-free optimization to ensure stable elicitation of unsafe behaviors across heterogeneous models.

The core technical principles of RedVLA include Risk Scenario Synthesis and Risk Amplification. In the Risk Scenario Synthesis stage, the framework identifies critical interaction regions and positions risk factors within these regions to entangle them with the VLA's execution flow and elicit target unsafe behaviors. In the Risk Amplification stage, the framework iteratively refines the risk factor state through gradient-free optimization to ensure stable elicitation of unsafe behaviors across heterogeneous models.

Experiments demonstrate that RedVLA uncovers diverse unsafe behaviors across six representative VLA models, achieving an ASR of up to 95.5% within 10 optimization iterations. These models include OpenVLA, OpenVLA-OFT, VLA-Adapter, VLA-Adapter-Pro, π0, and π0.5. SimpleVLA-Guard, built from RedVLA-generated data, reduces online ASR by 59.5% with minimal impact on task performance.

However, RedVLA's risk factor initialization depends on the agent's initial state and instruction, which may lead to instability in eliciting target unsafe behaviors in certain scenarios. Additionally, the framework's optimization process requires multiple iterations, which may not be efficient when computational resources are limited. Future research directions include enhancing RedVLA's generalization across different tasks and environments, optimizing the risk amplification process to reduce computational overhead, and further improving SimpleVLA-Guard's cross-task detection and intervention capabilities.

Deep Analysis

Background

Vision-Language-Action (VLA) models are rapidly advancing towards generalist robotic policies, enabling end-to-end learning from vision and language to action. These models are expanding their capabilities across critical domains such as manipulation, autonomous driving, and robotic surgery. However, with their increasing capabilities, safety concerns have been substantially amplified. Existing red teaming methods have extensively studied semantic and intent-based vulnerabilities in Large Language Models (LLMs) and Vision-Language Models (VLMs), but they overlook the unique safety challenges posed by VLA models in the physical space. These risks are inherent to the embodied agent-environment interaction and are not effectively detected by existing red teaming methods. Proactively identifying and mitigating physical safety risks is a crucial prerequisite for large-scale, real-world deployment of VLA models.

Core Problem

The deployment of VLA models in the real world is limited by the risk of unpredictable and irreversible physical harm. Existing red teaming methods primarily target semantic and intent vulnerabilities, overlooking the unique safety challenges posed by VLA models in the physical space. These risks arise from the embodied agent-environment interaction and are not effectively detected by existing red teaming methods. Therefore, the core problem is how to proactively uncover the potential physical safety risks of VLA models and maximize physical safety risks without compromising the benign nature of the original scene and the semantic consistency of task instructions.

Innovation

The core innovations of RedVLA include: 1) Risk Scenario Synthesis, which identifies critical interaction regions and positions risk factors within these regions to entangle them with the VLA's execution flow and elicit target unsafe behaviors; 2) Risk Amplification, which iteratively refines the risk factor state through gradient-free optimization to ensure stable elicitation of unsafe behaviors across heterogeneous models. Unlike previous methods that primarily target semantic and intent vulnerabilities, RedVLA shifts the source of risks from the intent space to the physical space, introducing a novel risk amplification method that ensures stable elicitation of unsafe behaviors across heterogeneous models.

Methodology

The methodology of RedVLA includes the following key steps:

�� Risk Scenario Synthesis: Identifies critical interaction regions and positions risk factors within these regions to entangle them with the VLA's execution flow and elicit target unsafe behaviors.

�� Interaction Identification: Analyzes end-effector trajectories from benign rollouts to identify critical interaction regions, including transit, grasping, and vibration regions.

�� Risk Instantiation: Positions risk factors within critical interaction regions to entangle them with the VLA's execution flow and elicit target unsafe behaviors.

�� Risk Amplification: Iteratively refines the risk factor state through gradient-free optimization to ensure stable elicitation of unsafe behaviors across heterogeneous models.

�� Trajectory-Driven Risk Amplification: Uses trajectory spatial features to guide the optimization of the risk factor state, maximizing the likelihood of safety violations.

Experiments

The experimental design includes testing RedVLA's performance on six representative VLA models: OpenVLA, OpenVLA-OFT, VLA-Adapter, VLA-Adapter-Pro, π0, and π0.5. Experiments are conducted on the widely-adopted LIBERO benchmark, injecting risk factors into benign scenarios to elicit safety violations. Specifically, these violations span three safety cost types (state-level, cumulative-level, and conditional-level) and three physical hazard categories (resource damage, dangerous item misuse, and robot damage). By intersecting these two dimensions, ten risk scenario suites are constructed, e.g., Cumulative-Level Resource Damage. Each risk scenario is evaluated over 10 trials with different random seeds, and all metrics are averaged across trials.

Results

Experiments demonstrate that RedVLA uncovers diverse unsafe behaviors across six representative VLA models, achieving an ASR of up to 95.5% within 10 optimization iterations. These models include OpenVLA, OpenVLA-OFT, VLA-Adapter, VLA-Adapter-Pro, π0, and π0.5. RedVLA proactively elicits unsafe behaviors at state, cumulative, and conditional levels, achieving an average ASR of 92.7%, outperforming OpenVLA. SimpleVLA-Guard, built from RedVLA-generated data, reduces online ASR by 59.5% with minimal impact on task performance.

Applications

RedVLA's application scenarios include safety evaluation and monitoring in domains such as robotic manipulation, autonomous driving, and surgical robotics. By systematically introducing potential risk factors and uncovering unsafe behaviors, RedVLA provides an essential safety evaluation tool for VLA models in these domains. Additionally, SimpleVLA-Guard can be used for real-time detection and intervention of unsafe behaviors, reducing online ASR and ensuring the safe execution of tasks.

Limitations & Outlook

RedVLA's risk factor initialization depends on the agent's initial state and instruction, which may lead to instability in eliciting target unsafe behaviors in certain scenarios. Additionally, the framework's optimization process requires multiple iterations, which may not be efficient when computational resources are limited. SimpleVLA-Guard's cross-task generalization capability is limited, with detection performance dropping on unseen tasks. Future research directions include enhancing RedVLA's generalization across different tasks and environments, optimizing the risk amplification process to reduce computational overhead, and further improving SimpleVLA-Guard's cross-task detection and intervention capabilities.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. You have a recipe (like the task instructions for a VLA model) that you need to follow to complete a dish (like the model executing a task). But in the kitchen, there are potential dangers, like sharp knives or a hot stove (similar to physical risks in a VLA model).

RedVLA is like a smart assistant that checks every corner of the kitchen before you start cooking, identifying areas that might cause you harm. It places knives in a safe spot and ensures the stove won't splash hot oil, allowing you to cook safely without unexpected accidents.

This assistant doesn't just check the kitchen before you start; it also observes you while you cook. If it notices you might accidentally touch something dangerous, it warns you and even helps adjust the pot's position if necessary to ensure your safety.

In this way, RedVLA helps VLA models avoid potential physical risks during task execution, just like this smart assistant helps you safely complete your cooking.

ELI14 Explained like you're 14

Hey there! Did you know that robots can make mistakes, especially when they need to see, hear, and move all at once? Imagine you're playing a super complex game where you have to watch the screen, listen to instructions, and quickly hit the right buttons. Sounds tough, right?

Scientists invented something called RedVLA to help these robots not mess up. It's like a super helper in the game that tells you where the traps are so you don't step on them.

RedVLA checks the environment before the robot starts moving, finding places that might cause the robot to make a mistake. For example, it might tell the robot, "Hey, watch out for that step!" so the robot doesn't fall.

And this helper doesn't stop there. It keeps watching while the robot moves, ready to warn if something goes wrong. It's like having a buddy cheering you on and pointing out dangers while you play! Isn't that cool?

Glossary

Vision-Language-Action Model

A model capable of learning and executing actions from visual and language inputs, commonly used in domains like robotic manipulation and autonomous driving.

Used in the paper to test its physical safety.

Red Teaming

A method of evaluating system security by simulating attacks to identify potential vulnerabilities.

Used to assess the physical safety of VLA models.

Risk Scenario Synthesis

Identifies critical interaction regions and positions risk factors within these regions to entangle them with the model's execution flow and elicit unsafe behaviors.

The first stage of the RedVLA method.

Risk Amplification

Iteratively refines the risk factor state through gradient-free optimization to ensure stable elicitation of unsafe behaviors across heterogeneous models.

The second stage of the RedVLA method.

Attack Success Rate (ASR)

Measures the percentage of episodes in which the target safety violation is successfully triggered, an important metric for evaluating red teaming effectiveness.

Used to evaluate RedVLA's performance across different models.

SimpleVLA-Guard

A lightweight safety guard built from RedVLA-generated data for real-time detection and intervention of unsafe behaviors.

Used to reduce online ASR and ensure safe task execution.

Gradient-Free Optimization

An optimization method that does not rely on gradient information, often used when the objective function is non-differentiable or gradients are difficult to compute.

Used to optimize the risk factor state.

LIBERO Benchmark

A widely-adopted benchmark used to evaluate the performance of VLA models across different tasks and environments.

Used in experiments to test RedVLA's performance.

State-Level Safety Cost

Safety cost incurred when a single state-action pair directly constitutes a physical risk, typically triggered at the state level.

Used to evaluate safety violations in different risk scenarios.

Cumulative-Level Safety Cost

Potential risks arising from the temporal accumulation of specific behaviors, typically occurring over the entire trajectory.

Used to evaluate safety violations in different risk scenarios.

Conditional-Level Safety Cost

Risks that arise only when a precondition is active, typically defined by a pair of precursor and consequent predicates.

Used to evaluate safety violations in different risk scenarios.

Trajectory-Driven Risk Amplification

Uses trajectory spatial features to guide the optimization of the risk factor state, maximizing the likelihood of safety violations.

A key step in the RedVLA method.

Cross-Task Generalization

The ability of a model to maintain good performance on unseen tasks, an important indicator of model robustness.

Used to evaluate SimpleVLA-Guard's detection performance.

Functional Conformal Prediction

A method for threshold calibration that enables preemptive halting before unsafe behavior occurs, reducing online ASR without significantly impacting task performance.

Used for online intervention in SimpleVLA-Guard.

Zero-Order Optimization

An optimization method that does not rely on gradient information, often used when the objective function is non-differentiable or gradients are difficult to compute.

Used to optimize the risk factor state.

Open Questions Unanswered questions from this research

1 How can RedVLA's generalization across different tasks and environments be improved? Current methods show performance drops on unseen tasks, necessitating the development of more robust optimization strategies.
2 How can the risk amplification process of RedVLA be optimized without increasing computational overhead? Current methods require multiple iterations, which may not be efficient when computational resources are limited.
3 How can SimpleVLA-Guard's cross-task detection and intervention capabilities be enhanced? Current methods show decreased detection performance on unseen tasks, necessitating the development of more generalized detection algorithms.
4 How can RedVLA be applied to other types of multimodal models to assess their safety in a broader range of AI systems? Current research focuses primarily on VLA models, and future exploration in other domains is promising.
5 How can physical safety risks be maximized without compromising the benign nature of the original scene and the semantic consistency of task instructions? Current methods may fail to consistently elicit target unsafe behaviors in certain scenarios, necessitating the development of more robust risk scenario synthesis strategies.
6 How can online ASR be reduced without significantly impacting task performance? Current methods achieve this through SimpleVLA-Guard, but future exploration of more efficient online intervention strategies is needed.
7 How can RedVLA's optimization efficiency be improved without increasing computational overhead? Current methods require multiple iterations, which may not be efficient when computational resources are limited, necessitating the development of more efficient optimization algorithms.

Applications

Immediate Applications

Robotic Manipulation Safety Evaluation

RedVLA can be used to assess the physical safety of robots performing complex tasks, helping to identify potential safety risks and prevent them.

Autonomous Driving Safety Monitoring

Data generated by RedVLA can enable safer deployment of autonomous driving systems in the real world, reducing accidents caused by physical risks.

Surgical Robotics Safety Guard

SimpleVLA-Guard can be used for real-time detection and intervention of unsafe behaviors in surgical robots, ensuring the safety of surgical procedures.

Long-term Vision

Cross-Domain Safety Evaluation

In the future, RedVLA can be expanded to other multimodal models, assessing their safety across different domains and advancing broader AI system safety research.

Comprehensive Safety Framework for Intelligent Systems

With continuous optimization and expansion, RedVLA could become a standard framework for safety evaluation of intelligent systems, helping to build more reliable and trustworthy AI systems.

Abstract

The real-world deployment of Vision-Language-Action (VLA) models remains limited by the risk of unpredictable and irreversible physical harm. However, we currently lack effective mechanisms to proactively detect these physical safety risks before deployment. To address this gap, we propose \textbf{RedVLA}, the first red teaming framework for physical safety in VLA models. We systematically uncover unsafe behaviors through a two-stage process: (I) \textbf{Risk Scenario Synthesis} constructs a valid and task-feasible initial risk scene. Specifically, it identifies critical interaction regions from benign trajectories and positions the risk factor within these regions, aiming to entangle it with the VLA's execution flow and elicit a target unsafe behavior. (II) \textbf{Risk Amplification} ensures stable elicitation across heterogeneous models. It iteratively refines the risk factor state through gradient-free optimization guided by trajectory features. Experiments on six representative VLA models show that RedVLA uncovers diverse unsafe behaviors and achieves the ASR up to 95.5\% within 10 optimization iterations. To mitigate these risks, we further propose SimpleVLA-Guard, a lightweight safety guard built from RedVLA-generated data. Our data, assets, and code are available \href{https://redvla.github.io}{here}.

cs.RO

References (20)

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti et al.

2024 1982 citations ⭐ Influential View Analysis →

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao et al.

2023 753 citations ⭐ Influential View Analysis →

RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model

Shunlei Li, Jin Wang, Rui Dai et al.

2024 28 citations ⭐ Influential View Analysis →

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou et al.

2023 999 citations View Analysis →

MART: Improving LLM Safety with Multi-round Automatic Red-Teaming

Suyu Ge, Chunting Zhou, Rui Hou et al.

2023 175 citations View Analysis →

VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

Borong Zhang, Jiahao Li, Jiacheng Shen et al.

2025 9 citations View Analysis →

π0: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess et al.

2024 1488 citations View Analysis →

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Qiao Gu, Yuanliang Ju, Shengxiang Sun et al.

2025 30 citations View Analysis →

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, K. Upasani, Jianfeng Chi et al.

2023 930 citations View Analysis →

RLBench: The Robot Learning Benchmark & Learning Environment

Stephen James, Zicong Ma, David Rovick Arrojo et al.

2019 850 citations View Analysis →

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Taowen Wang, Dongfang Liu, J. Liang et al.

2024 46 citations View Analysis →

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yuping Yan, Yuhan Xie, Yixin Zhang et al.

2025 6 citations View Analysis →

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang et al.

2024 227 citations View Analysis →

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

Xueyang Zhou, Yangming Xu, Guiyao Tie et al.

2025 35 citations View Analysis →

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major et al.

2021 7017 citations

AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models

Jiayu Li, Yunhan Zhao, Xiang Zheng et al.

2025 5 citations View Analysis →

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi et al.

2025 62 citations View Analysis →

Conformal Prediction: A Gentle Introduction

Anastasios Nikolas Angelopoulos, Stephen Bates

2023 569 citations

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal et al.

2022 2094 citations View Analysis →

On the Safety Concerns of Deploying LLMs/VLMs in Robotics: Highlighting the Risks and Vulnerabilities

Xiyang Wu, Ruiqi Xian, Tianrui Guan et al.

2024 39 citations

RedVLA: Physical Red Teaming for Vision-Language-Action Models

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Vision-Language-Action Model

Red Teaming

Risk Scenario Synthesis

Risk Amplification

Attack Success Rate (ASR)

SimpleVLA-Guard

Gradient-Free Optimization

LIBERO Benchmark

State-Level Safety Cost

Cumulative-Level Safety Cost

Conditional-Level Safety Cost

Trajectory-Driven Risk Amplification

Cross-Task Generalization

Functional Conformal Prediction

Zero-Order Optimization

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Robotic Manipulation Safety Evaluation

Autonomous Driving Safety Monitoring

Surgical Robotics Safety Guard

Long-term Vision

Cross-Domain Safety Evaluation

Comprehensive Safety Framework for Intelligent Systems

Abstract

References (20)

Related Papers

Passage-Aware Structural Mapping for RGB-D Visual SLAM

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Pushing Radar Odometry Beyond the Pavement: Current Capabilities and Challenges

Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

Computational Design and Co-Robotic Fabrication for Material Reuse in Architecture

Guiding Vector Field Generation via Score-based Diffusion Model