SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
SafetyALFRED evaluates safety planning in multimodal LLMs in kitchen settings, finding good hazard recognition but low risk mitigation success.
Key Findings
Methodology
SafetyALFRED builds on the ALFRED benchmark, extending it with six categories of real-world kitchen hazards. The study evaluates 11 state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition but also active risk mitigation through embodied planning. The experiments reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates are low in comparison.
Key Results
- In QA tasks, models can recognize safety hazards with 92% average accuracy, but in embodied tasks, even with ground-truth environment state information, the average mitigation success rate is only 60%.
- Even with metadata, open-weight models achieve less than 20% accuracy on average in all other categories, despite higher hazard identification rates in QA tasks.
- Without metadata, models struggle to achieve above 20% accuracy for most categories, with only fire hazard, unsanitary, and spoilage performing better, reaching over 29%, over 35%, and nearly 100% accuracy, respectively, using closed-weight models.
Significance
The study highlights the inadequacy of existing QA evaluations for physical safety, advocating a shift towards benchmarks that prioritize corrective actions in embodied contexts. By introducing SafetyALFRED, the research demonstrates the capability gap in multimodal LLMs for recognizing and mitigating safety hazards in real-world kitchen environments. This finding is significant for academia and industry, particularly in the development and deployment of autonomous robotic systems, emphasizing the need for more comprehensive safety evaluation methods.
Technical Contribution
SafetyALFRED provides a new evaluation framework by extending the ALFRED benchmark to include six categories of kitchen hazards. The technical contribution of the study lies in revealing the performance gap of multimodal LLMs between static QA and dynamic embodied tasks, proposing a multi-agent framework to decouple hazard recognition from mitigation, although this approach only slightly improves performance.
Novelty
SafetyALFRED is the first study to extend the safety evaluation of multimodal LLMs from static QA to embodied contexts. Unlike existing benchmarks such as ASIMOV and MM-SafetyBench, SafetyALFRED not only focuses on hazard recognition but also emphasizes active risk mitigation, filling a critical gap in current research.
Limitations
- The low mitigation success rate in embodied tasks indicates difficulties in planning and executing corrective actions, especially without metadata.
- Despite good performance in QA tasks, models fail to effectively utilize their safety knowledge for actual behavior in embodied tasks.
- The multi-agent framework, while slightly improving performance, does not fully resolve the alignment issue between recognition and mitigation.
Future Work
Future research directions include developing more effective model architectures to improve risk mitigation capabilities of multimodal LLMs in embodied tasks. Additionally, studies can explore more complex environments and tasks to further test and enhance the safety planning capabilities of models. The community can also focus on better translating abstract safety knowledge into concrete actions.
AI Executive Summary
Multimodal large language models (MLLMs) are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. Existing safety evaluations primarily focus on hazard recognition through disembodied question answering (QA) settings, neglecting the ability to actively mitigate risks in embodied contexts.
To address this issue, we introduce SafetyALFRED, an extension of the ALFRED benchmark augmented with six categories of real-world kitchen hazards. We evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison.
This finding demonstrates that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset to facilitate further research and development.
In our experiments, we used 30 kitchen environments and five task types in AI2Thor, involving object manipulation (move, stack, wash, heat, or cool), followed by placing the object at a final destination. We found that while models perform well in QA tasks, they fail to effectively utilize their safety knowledge for actual behavior in embodied tasks.
Future research directions include developing more effective model architectures to improve risk mitigation capabilities of multimodal LLMs in embodied tasks. Additionally, studies can explore more complex environments and tasks to further test and enhance the safety planning capabilities of models. The community can also focus on better translating abstract safety knowledge into concrete actions.
Deep Analysis
Background
Multimodal large language models (MLLMs) have demonstrated remarkable reasoning and decision-making capabilities, leading to their widespread adoption as autonomous embodied agents in both simulated and physical interactive environments. They can translate high-level natural language instructions into executable plans. However, as MLLMs transition into these roles, a major concern is their ability to identify and proactively resolve safety hazards, i.e., observable environmental states that, if left uncorrected, pose risks of physical injury, property damage, or resource loss. Despite this need, prior safety benchmarks like ASIMOV, Multimodal Situational Safety, and MM-SafetyBench have largely focused on the recognition of hazards through question-answering (QA) tasks based on static images, videos, or scenarios. A critical gap remains in evaluating an agent’s ability to not only recognize safety hazards but also generate plans that mitigate them in a dynamic embodied setting.
Core Problem
Multimodal large language models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. Existing safety evaluations primarily focus on hazard recognition through disembodied question answering (QA) settings, neglecting the ability to actively mitigate risks in embodied contexts. To evaluate whether MLLMs can translate safety knowledge acquired from web-scale pre-training into concrete behavior, we formulate a new safety problem. Given a task instruction and a multimodal observation, the model must advance the assigned task while proactively generating a plan to rectify hazards that could cause immediate or future harm.
Innovation
SafetyALFRED is the first study to extend the safety evaluation of multimodal LLMs from static QA to embodied contexts. Unlike existing benchmarks such as ASIMOV and MM-SafetyBench, SafetyALFRED not only focuses on hazard recognition but also emphasizes active risk mitigation, filling a critical gap in current research. We introduce an extension of the ALFRED benchmark for embodied instruction following, augmented with six carefully selected safety hazards that represent real-world risks in common kitchen settings. Using SafetyALFRED, we evaluate eleven MLLMs in two settings: (1) where the agent acts as a safety judge and identifies hazards in the scene; and (2) an embodied task where the agent completes the assigned task while immediately mitigating any safety hazards.
Methodology
- �� SafetyALFRED builds on the ALFRED benchmark, extending it with six categories of real-world kitchen hazards.
- �� The study evaluates 11 state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition but also active risk mitigation through embodied planning.
- �� The experiments reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates are low in comparison.
- �� We propose a multi-agent framework to decouple hazard recognition from mitigation, although this approach only slightly improves performance.
Experiments
In our experiments, we used 30 kitchen environments and five task types in AI2Thor, involving object manipulation (move, stack, wash, heat, or cool), followed by placing the object at a final destination. We evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison.
Results
Our experimental results show that while models perform well in QA tasks, they fail to effectively utilize their safety knowledge for actual behavior in embodied tasks. Even with metadata, open-weight models achieve less than 20% accuracy on average in all other categories, despite higher hazard identification rates in QA tasks. The multi-agent framework, while slightly improving performance, does not fully resolve the alignment issue between recognition and mitigation.
Applications
The application scenarios of SafetyALFRED include the development and deployment of autonomous robotic systems, particularly in situations where identifying and mitigating safety hazards in real-world environments is required. By introducing SafetyALFRED, the research demonstrates the capability gap in multimodal LLMs for recognizing and mitigating safety hazards in real-world kitchen environments. This finding is significant for academia and industry, emphasizing the need for more comprehensive safety evaluation methods.
Limitations & Outlook
The low mitigation success rate in embodied tasks indicates difficulties in planning and executing corrective actions, especially without metadata. Despite good performance in QA tasks, models fail to effectively utilize their safety knowledge for actual behavior in embodied tasks. The multi-agent framework, while slightly improving performance, does not fully resolve the alignment issue between recognition and mitigation. Future research directions include developing more effective model architectures to improve risk mitigation capabilities of multimodal LLMs in embodied tasks.
Plain Language Accessible to non-experts
Imagine you're cooking in a kitchen, and there are many potential hazards like a fire on the stove, a puddle on the floor, and an open cabinet door. A multimodal large language model is like a smart assistant that can help you identify these hazards and tell you how to avoid them. However, while these models are good at recognizing hazards, they struggle when it comes to actually taking action to eliminate these hazards.
It's like seeing a phone fall into the sink and knowing it could lead to damage or other issues. The model can recognize this problem, but it might get confused about how to solve it. It might know that the phone should be removed from the sink, but it could face challenges in actually doing so.
This is like having a great plan but encountering obstacles during execution. Models need more information and better strategies to effectively solve these problems. Through continuous learning and improvement, these models can become smarter and more efficient, helping us manage safety issues better in our daily lives.
ELI14 Explained like you're 14
Hey there! Imagine you're cooking in the kitchen and suddenly notice the fire on the stove is too high, or there's a puddle on the floor that could make you slip. A multimodal large language model is like a super-smart assistant that can help you spot these problems and tell you how to fix them.
But, while these models are great at spotting issues, they sometimes struggle to actually solve them, just like when we get stuck on a level in a video game. They know there's a hazard, but they might get a little confused about what to do.
It's like learning a lot in school but sometimes facing tough questions in exams. These models also need to keep learning and practicing to do better in real-life situations.
So, in the future, these models will get smarter and smarter, helping us handle all sorts of problems in life, like an amazing helper!
Glossary
Multimodal Large Language Model
A multimodal large language model is an AI model capable of processing multiple input forms (such as text, images, and videos). They can translate natural language instructions into executable plans.
Used in the paper to evaluate the model's ability to recognize and mitigate safety hazards in kitchen environments.
Embodied Agent
An embodied agent refers to an autonomous system capable of performing tasks in physical or simulated environments. They can interact with the environment through perception and action.
Used in the paper to describe the model's ability to perform tasks in the AI2Thor environment.
ALFRED Benchmark
ALFRED is a benchmark for evaluating embodied instruction following capabilities, involving object manipulation and task completion.
SafetyALFRED builds on the ALFRED benchmark to evaluate safety planning capabilities.
Question Answering Task
A question answering task is a test that evaluates a model's ability to understand and answer questions, usually based on given text or images.
Used in the paper to evaluate the model's ability to recognize safety hazards.
Risk Mitigation
Risk mitigation refers to the process of identifying potential hazards and taking measures to reduce or eliminate the risk.
Used in the paper to evaluate the model's ability to actively address safety hazards in embodied tasks.
AI2Thor Environment
AI2Thor is an interactive 3D platform used to simulate home environments, commonly used for training and testing embodied agents.
Used in the paper to create experimental scenarios and tasks.
Metadata
Metadata is data about data, providing information about the content, structure, and context of the data.
Used in the paper to provide additional environmental information to help the model recognize and mitigate hazards.
Multi-Agent Framework
A multi-agent framework is a system architecture involving the collaboration of multiple autonomous agents to complete complex tasks.
Used in the paper to separate the processes of hazard recognition and mitigation.
Alignment Gap
An alignment gap refers to the performance difference of a model across different tasks or settings, usually reflected in the gap between recognition and execution capabilities.
Used in the paper to describe the performance gap of models in QA and embodied tasks.
Static Evaluation
Static evaluation refers to the assessment of a model's capabilities without considering dynamic changes or interactions.
Used in the paper to describe the limitations of existing safety evaluation methods.
Open Questions Unanswered questions from this research
- 1 Although models perform well in hazard recognition, they fail to effectively utilize their safety knowledge for actual behavior in embodied tasks. This indicates a need for further research on how to translate abstract safety knowledge into concrete actions.
- 2 The existing multi-agent framework, while slightly improving performance, does not fully resolve the alignment issue between recognition and mitigation. More effective architectures need to be explored to improve this gap.
- 3 The low mitigation success rate in embodied tasks without metadata indicates difficulties in planning and executing corrective actions. Further research is needed to improve the planning capabilities of models in complex environments.
- 4 Despite good performance in QA tasks, models fail to effectively utilize their safety knowledge for actual behavior in embodied tasks. This indicates a need for further research on how to translate abstract safety knowledge into concrete actions.
- 5 Future research directions include developing more effective model architectures to improve risk mitigation capabilities of multimodal LLMs in embodied tasks.
Applications
Immediate Applications
Home Robotics
Home robots can use the SafetyALFRED evaluation framework to improve their ability to recognize and mitigate safety hazards in home environments, enhancing safety and efficiency.
Industrial Automation
Industrial automation systems can apply SafetyALFRED's evaluation methods to identify and mitigate potential hazards on production lines, improving production safety.
Smart Home Systems
Smart home systems can use the SafetyALFRED evaluation framework to improve their ability to recognize and mitigate safety hazards in home environments, enhancing safety and efficiency.
Long-term Vision
Autonomous Driving
Autonomous vehicles can use the SafetyALFRED evaluation framework to improve their ability to recognize and mitigate safety hazards in complex traffic environments, enhancing driving safety.
Smart Cities
Smart city systems can apply SafetyALFRED's evaluation methods to identify and mitigate potential hazards in urban environments, improving city safety and livability.
Abstract
Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git
References (20)
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon et al.
Can AI Perceive Physical Danger and Intervene?
Ab-hishek Jindal, Dmitry Kalashnikov, Oscar Chang et al.
Work-related injuries and illnesses among kitchen workers at two major students’ hostels
Ghada O. Wassif, Abeer Abdelsalam, W. Eldin et al.
Food Safety in Home Kitchens: A Synthesis of the Literature
C. Byrd-Bredbenner, J. Berning, Jennifer Martin-Biggers et al.
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
Jiaming Ji, Donghai Hong, Borong Zhang et al.
Plug in the Safety Chip: Enforcing Constraints for LLM-driven Robot Agents
Ziyi Yang, S. S. Raman, Ankit Shah et al.
HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments
Qinhong Zhou, Sunli Chen, Yisong Wang et al.
IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Xiaoya Lu, Zeren Chen, Xuhao Hu et al.
Sim-to-Real Transfer in Robotics: Addressing the Gap between Simulation and Real-World Performance
N. Chukwurah, A. Adebayo, O. Ajayi
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
Sheng Yin, Xianghe Pang, Yuanzhuo Ding et al.
Generating Robot Constitutions & Benchmarks for Semantic Safety
P. Sermanet, Anirudha Majumdar, A. Irpan et al.
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez, I. Ribeiro
AI2-THOR: An Interactive 3D Environment for Visual AI
Eric Kolve, Roozbeh Mottaghi, Winson Han et al.
A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents
Yuting Huang, Leilei Ding, Zhipeng Tang et al.
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei, Yifei Wang, Yisen Wang
Safety Control of Service Robots with LLMs and Embodied Knowledge Graphs
Yong Qi, Gabriel Kyebambo, Siyuan Xie et al.
Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models for Robot Task Planning
Siyuan Li, Zhe Ma, Feifan Liu et al.
SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents
Ruolin Chen, Yinqian Sun, Jihang Wang et al.
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
Adina Williams, Nikita Nangia, Samuel R. Bowman
Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey
Wenshuai Zhao, J. P. Queralta, Tomi Westerlund