SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

TL;DR

SafetyALFRED evaluates safety planning in multimodal LLMs in kitchen settings, finding good hazard recognition but low risk mitigation success.

cs.AI 🔴 Advanced 2026-04-22 35 views

Josue Torres-Fonseca Naihao Deng Yinpei Dai Shane Storks Yichi Zhang Rada Mihalcea Casey Kennington Joyce Chai

AI Reader Arxiv Page Download PDF

multimodal LLMs safety planning risk mitigation ALFRED kitchen environment

Key Findings

Methodology

SafetyALFRED builds on the ALFRED benchmark, extending it with six categories of real-world kitchen hazards. The study evaluates 11 state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition but also active risk mitigation through embodied planning. The experiments reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates are low in comparison.

Key Results

In QA tasks, models can recognize safety hazards with 92% average accuracy, but in embodied tasks, even with ground-truth environment state information, the average mitigation success rate is only 60%.
Even with metadata, open-weight models achieve less than 20% accuracy on average in all other categories, despite higher hazard identification rates in QA tasks.
Without metadata, models struggle to achieve above 20% accuracy for most categories, with only fire hazard, unsanitary, and spoilage performing better, reaching over 29%, over 35%, and nearly 100% accuracy, respectively, using closed-weight models.

Significance

The study highlights the inadequacy of existing QA evaluations for physical safety, advocating a shift towards benchmarks that prioritize corrective actions in embodied contexts. By introducing SafetyALFRED, the research demonstrates the capability gap in multimodal LLMs for recognizing and mitigating safety hazards in real-world kitchen environments. This finding is significant for academia and industry, particularly in the development and deployment of autonomous robotic systems, emphasizing the need for more comprehensive safety evaluation methods.

Technical Contribution

SafetyALFRED provides a new evaluation framework by extending the ALFRED benchmark to include six categories of kitchen hazards. The technical contribution of the study lies in revealing the performance gap of multimodal LLMs between static QA and dynamic embodied tasks, proposing a multi-agent framework to decouple hazard recognition from mitigation, although this approach only slightly improves performance.

Novelty

Limitations

The low mitigation success rate in embodied tasks indicates difficulties in planning and executing corrective actions, especially without metadata.
Despite good performance in QA tasks, models fail to effectively utilize their safety knowledge for actual behavior in embodied tasks.
The multi-agent framework, while slightly improving performance, does not fully resolve the alignment issue between recognition and mitigation.

Future Work

Future research directions include developing more effective model architectures to improve risk mitigation capabilities of multimodal LLMs in embodied tasks. Additionally, studies can explore more complex environments and tasks to further test and enhance the safety planning capabilities of models. The community can also focus on better translating abstract safety knowledge into concrete actions.

AI Executive Summary

Multimodal large language models (MLLMs) are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. Existing safety evaluations primarily focus on hazard recognition through disembodied question answering (QA) settings, neglecting the ability to actively mitigate risks in embodied contexts.

To address this issue, we introduce SafetyALFRED, an extension of the ALFRED benchmark augmented with six categories of real-world kitchen hazards. We evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison.

This finding demonstrates that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset to facilitate further research and development.

In our experiments, we used 30 kitchen environments and five task types in AI2Thor, involving object manipulation (move, stack, wash, heat, or cool), followed by placing the object at a final destination. We found that while models perform well in QA tasks, they fail to effectively utilize their safety knowledge for actual behavior in embodied tasks.

Deep Analysis

Background

Multimodal large language models (MLLMs) have demonstrated remarkable reasoning and decision-making capabilities, leading to their widespread adoption as autonomous embodied agents in both simulated and physical interactive environments. They can translate high-level natural language instructions into executable plans. However, as MLLMs transition into these roles, a major concern is their ability to identify and proactively resolve safety hazards, i.e., observable environmental states that, if left uncorrected, pose risks of physical injury, property damage, or resource loss. Despite this need, prior safety benchmarks like ASIMOV, Multimodal Situational Safety, and MM-SafetyBench have largely focused on the recognition of hazards through question-answering (QA) tasks based on static images, videos, or scenarios. A critical gap remains in evaluating an agent’s ability to not only recognize safety hazards but also generate plans that mitigate them in a dynamic embodied setting.

Core Problem

Multimodal large language models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. Existing safety evaluations primarily focus on hazard recognition through disembodied question answering (QA) settings, neglecting the ability to actively mitigate risks in embodied contexts. To evaluate whether MLLMs can translate safety knowledge acquired from web-scale pre-training into concrete behavior, we formulate a new safety problem. Given a task instruction and a multimodal observation, the model must advance the assigned task while proactively generating a plan to rectify hazards that could cause immediate or future harm.

Innovation

SafetyALFRED is the first study to extend the safety evaluation of multimodal LLMs from static QA to embodied contexts. Unlike existing benchmarks such as ASIMOV and MM-SafetyBench, SafetyALFRED not only focuses on hazard recognition but also emphasizes active risk mitigation, filling a critical gap in current research. We introduce an extension of the ALFRED benchmark for embodied instruction following, augmented with six carefully selected safety hazards that represent real-world risks in common kitchen settings. Using SafetyALFRED, we evaluate eleven MLLMs in two settings: (1) where the agent acts as a safety judge and identifies hazards in the scene; and (2) an embodied task where the agent completes the assigned task while immediately mitigating any safety hazards.

Methodology

�� SafetyALFRED builds on the ALFRED benchmark, extending it with six categories of real-world kitchen hazards.
�� The study evaluates 11 state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition but also active risk mitigation through embodied planning.
�� The experiments reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates are low in comparison.
�� We propose a multi-agent framework to decouple hazard recognition from mitigation, although this approach only slightly improves performance.

Experiments

In our experiments, we used 30 kitchen environments and five task types in AI2Thor, involving object manipulation (move, stack, wash, heat, or cool), followed by placing the object at a final destination. We evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison.

Results

Our experimental results show that while models perform well in QA tasks, they fail to effectively utilize their safety knowledge for actual behavior in embodied tasks. Even with metadata, open-weight models achieve less than 20% accuracy on average in all other categories, despite higher hazard identification rates in QA tasks. The multi-agent framework, while slightly improving performance, does not fully resolve the alignment issue between recognition and mitigation.

Applications

The application scenarios of SafetyALFRED include the development and deployment of autonomous robotic systems, particularly in situations where identifying and mitigating safety hazards in real-world environments is required. By introducing SafetyALFRED, the research demonstrates the capability gap in multimodal LLMs for recognizing and mitigating safety hazards in real-world kitchen environments. This finding is significant for academia and industry, emphasizing the need for more comprehensive safety evaluation methods.

Limitations & Outlook

The low mitigation success rate in embodied tasks indicates difficulties in planning and executing corrective actions, especially without metadata. Despite good performance in QA tasks, models fail to effectively utilize their safety knowledge for actual behavior in embodied tasks. The multi-agent framework, while slightly improving performance, does not fully resolve the alignment issue between recognition and mitigation. Future research directions include developing more effective model architectures to improve risk mitigation capabilities of multimodal LLMs in embodied tasks.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen, and there are many potential hazards like a fire on the stove, a puddle on the floor, and an open cabinet door. A multimodal large language model is like a smart assistant that can help you identify these hazards and tell you how to avoid them. However, while these models are good at recognizing hazards, they struggle when it comes to actually taking action to eliminate these hazards.

It's like seeing a phone fall into the sink and knowing it could lead to damage or other issues. The model can recognize this problem, but it might get confused about how to solve it. It might know that the phone should be removed from the sink, but it could face challenges in actually doing so.

This is like having a great plan but encountering obstacles during execution. Models need more information and better strategies to effectively solve these problems. Through continuous learning and improvement, these models can become smarter and more efficient, helping us manage safety issues better in our daily lives.

ELI14 Explained like you're 14

Hey there! Imagine you're cooking in the kitchen and suddenly notice the fire on the stove is too high, or there's a puddle on the floor that could make you slip. A multimodal large language model is like a super-smart assistant that can help you spot these problems and tell you how to fix them.

But, while these models are great at spotting issues, they sometimes struggle to actually solve them, just like when we get stuck on a level in a video game. They know there's a hazard, but they might get a little confused about what to do.

It's like learning a lot in school but sometimes facing tough questions in exams. These models also need to keep learning and practicing to do better in real-life situations.

So, in the future, these models will get smarter and smarter, helping us handle all sorts of problems in life, like an amazing helper!

Glossary

Multimodal Large Language Model

A multimodal large language model is an AI model capable of processing multiple input forms (such as text, images, and videos). They can translate natural language instructions into executable plans.

Used in the paper to evaluate the model's ability to recognize and mitigate safety hazards in kitchen environments.

Embodied Agent

An embodied agent refers to an autonomous system capable of performing tasks in physical or simulated environments. They can interact with the environment through perception and action.

Used in the paper to describe the model's ability to perform tasks in the AI2Thor environment.

ALFRED Benchmark

ALFRED is a benchmark for evaluating embodied instruction following capabilities, involving object manipulation and task completion.

SafetyALFRED builds on the ALFRED benchmark to evaluate safety planning capabilities.

Question Answering Task

A question answering task is a test that evaluates a model's ability to understand and answer questions, usually based on given text or images.

Used in the paper to evaluate the model's ability to recognize safety hazards.

Risk Mitigation

Risk mitigation refers to the process of identifying potential hazards and taking measures to reduce or eliminate the risk.

Used in the paper to evaluate the model's ability to actively address safety hazards in embodied tasks.

AI2Thor Environment

AI2Thor is an interactive 3D platform used to simulate home environments, commonly used for training and testing embodied agents.

Used in the paper to create experimental scenarios and tasks.

Metadata

Metadata is data about data, providing information about the content, structure, and context of the data.

Used in the paper to provide additional environmental information to help the model recognize and mitigate hazards.

Multi-Agent Framework

A multi-agent framework is a system architecture involving the collaboration of multiple autonomous agents to complete complex tasks.

Used in the paper to separate the processes of hazard recognition and mitigation.

Alignment Gap

An alignment gap refers to the performance difference of a model across different tasks or settings, usually reflected in the gap between recognition and execution capabilities.

Used in the paper to describe the performance gap of models in QA and embodied tasks.

Static Evaluation

Static evaluation refers to the assessment of a model's capabilities without considering dynamic changes or interactions.

Used in the paper to describe the limitations of existing safety evaluation methods.

Open Questions Unanswered questions from this research

1 Although models perform well in hazard recognition, they fail to effectively utilize their safety knowledge for actual behavior in embodied tasks. This indicates a need for further research on how to translate abstract safety knowledge into concrete actions.
2 The existing multi-agent framework, while slightly improving performance, does not fully resolve the alignment issue between recognition and mitigation. More effective architectures need to be explored to improve this gap.
3 The low mitigation success rate in embodied tasks without metadata indicates difficulties in planning and executing corrective actions. Further research is needed to improve the planning capabilities of models in complex environments.
4 Despite good performance in QA tasks, models fail to effectively utilize their safety knowledge for actual behavior in embodied tasks. This indicates a need for further research on how to translate abstract safety knowledge into concrete actions.
5 Future research directions include developing more effective model architectures to improve risk mitigation capabilities of multimodal LLMs in embodied tasks.

Applications

Immediate Applications

Home Robotics

Home robots can use the SafetyALFRED evaluation framework to improve their ability to recognize and mitigate safety hazards in home environments, enhancing safety and efficiency.

Industrial Automation

Industrial automation systems can apply SafetyALFRED's evaluation methods to identify and mitigate potential hazards on production lines, improving production safety.

Smart Home Systems

Smart home systems can use the SafetyALFRED evaluation framework to improve their ability to recognize and mitigate safety hazards in home environments, enhancing safety and efficiency.

Long-term Vision

Autonomous Driving

Autonomous vehicles can use the SafetyALFRED evaluation framework to improve their ability to recognize and mitigate safety hazards in complex traffic environments, enhancing driving safety.

Smart Cities

Smart city systems can apply SafetyALFRED's evaluation methods to identify and mitigate potential hazards in urban environments, improving city safety and livability.

Abstract

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git

cs.AI cs.CL cs.RO

References (20)

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon et al.

2019 1028 citations ⭐ Influential View Analysis →

Can AI Perceive Physical Danger and Intervene?

Ab-hishek Jindal, Dmitry Kalashnikov, Oscar Chang et al.

2025 6 citations ⭐ Influential View Analysis →

Work-related injuries and illnesses among kitchen workers at two major students’ hostels

Ghada O. Wassif, Abeer Abdelsalam, W. Eldin et al.

2024 4 citations

Food Safety in Home Kitchens: A Synthesis of the Literature

C. Byrd-Bredbenner, J. Berning, Jennifer Martin-Biggers et al.

2013 252 citations

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference

Jiaming Ji, Donghai Hong, Borong Zhang et al.

2024 160 citations View Analysis →

Plug in the Safety Chip: Enforcing Constraints for LLM-driven Robot Agents

Ziyi Yang, S. S. Raman, Ankit Shah et al.

2023 92 citations View Analysis →

HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments

Qinhong Zhou, Sunli Chen, Yisong Wang et al.

2024 33 citations View Analysis →

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Xiaoya Lu, Zeren Chen, Xuhao Hu et al.

2025 20 citations View Analysis →

Sim-to-Real Transfer in Robotics: Addressing the Gap between Simulation and Real-World Performance

N. Chukwurah, A. Adebayo, O. Ajayi

2024 20 citations

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

Sheng Yin, Xianghe Pang, Yuanzhuo Ding et al.

2024 71 citations View Analysis →

Generating Robot Constitutions & Benchmarks for Semantic Safety

P. Sermanet, Anirudha Majumdar, A. Irpan et al.

2025 18 citations View Analysis →

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez, I. Ribeiro

2022 765 citations View Analysis →

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han et al.

2017 1387 citations View Analysis →

A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents

Yuting Huang, Leilei Ding, Zhipeng Tang et al.

2025 21 citations View Analysis →

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Zeming Wei, Yifei Wang, Yisen Wang

2023 455 citations View Analysis →

Safety Control of Service Robots with LLMs and Embodied Knowledge Graphs

Yong Qi, Gabriel Kyebambo, Siyuan Xie et al.

2024 8 citations View Analysis →

Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models for Robot Task Planning

Siyuan Li, Zhe Ma, Feifan Liu et al.

2024 10 citations View Analysis →

SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents

Ruolin Chen, Yinqian Sun, Jihang Wang et al.

2025 2 citations View Analysis →

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Adina Williams, Nikita Nangia, Samuel R. Bowman

2017 4979 citations View Analysis →

Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey

Wenshuai Zhao, J. P. Queralta, Tomi Westerlund

2020 976 citations View Analysis →

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Multimodal Large Language Model

Embodied Agent

ALFRED Benchmark

Question Answering Task

Risk Mitigation

AI2Thor Environment

Metadata

Multi-Agent Framework

Alignment Gap

Static Evaluation

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Home Robotics

Industrial Automation

Smart Home Systems

Long-term Vision

Autonomous Driving

Smart Cities

Abstract

References (20)

Related Papers

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

Large Language Models Exhibit Normative Conformity

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval