Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

TL;DR

This paper introduces a metadata-aware multi-prompt reasoning framework for zero-shot accident understanding, achieving a 15% improvement in harmonic mean score on CVPR benchmark.

cs.CV 🔴 Advanced 2026-06-10 54 views

Tarandeep Singh Soumyanetra Pal Soham Biswas Nishanth Chandran

AI Reader Arxiv Page Download PDF

Computer Vision Multimodal Learning Zero-Shot Reasoning Video Understanding Accident Detection

Key Findings

Methodology

The proposed approach decomposes the accident understanding task into three stages: temporal localization, semantic classification, and spatial grounding. The first stage employs vision-language similarity combined with motion cues via Meta’s Perception Encoder (PE) to identify a short temporal window around the impact event. The second stage utilizes a structured multi-prompt scheme with five complementary prompts (baseline, motion, geometry, contrast, and tiebreaker) fed into Qwen-3.5-VL 9B, with an entropy-gated voting mechanism to resolve conflicts. The third stage localizes the impact region using an open-vocabulary detector (OWL-v2), conditioned on the predicted accident type and scene layout, and aggregates detections across keyframes through a score-weighted centroid. This pipeline leverages rich metadata and multiple reasoning views to enhance robustness and interpretability.

Key Results

On the CVPR 2026 zero-shot ACCIDENT benchmark, the method achieved a harmonic mean score of 0.4015, surpassing the center-of-frame baseline score of 0.3487 by over 15%. The spatial localization component contributed the most, improving the score by 0.053, followed by temporal localization (0.039) and multi-prompt classification (0.0054). The approach demonstrated superior performance across all three tasks—time, type, and location—especially under challenging conditions such as low resolution, occlusion, and adverse weather.
Ablation studies confirmed that expanding the temporal window by ±2 seconds improved accuracy, and integrating multiple prompts with adjudication further enhanced robustness. The spatial localization, conditioned on accident type and scene context, yielded the largest performance boost, validating the effectiveness of task decomposition. The multi-prompt voting and pairwise adjudication mechanisms effectively reduced ambiguity and misclassification, especially in complex traffic scenarios.
The experimental results indicate that explicitly breaking down the accident understanding into temporal, semantic, and spatial components, combined with rich multimodal cues, significantly outperforms end-to-end single-prompt models, setting a new state-of-the-art in zero-shot accident detection.

Significance

This research advances the field of zero-shot multimodal understanding by demonstrating that task decomposition—dividing the complex accident scene analysis into localized temporal detection, semantic classification, and spatial grounding—can substantially improve performance. The framework leverages large vision-language models and open-vocabulary detectors, enabling generalization to unseen accident types without fine-tuning. Its practical impact lies in enhancing real-time traffic monitoring, autonomous driving safety, and emergency response systems, especially in scenarios lacking labeled training data. The approach addresses long-standing challenges of robustness, interpretability, and scalability in accident understanding, paving the way for more reliable intelligent surveillance solutions.

Technical Contribution

The paper’s main technical contributions include: 1) a novel temporal localization method combining vision-language similarity with motion cues via Meta’s PE, effectively narrowing down the impact window; 2) a structured multi-prompt classification scheme with five complementary prompts, improving semantic understanding and robustness; 3) a type- and scene-conditioned spatial localization strategy utilizing OWL-v2, which grounds the accident impact region conditioned on predicted accident type and scene context; 4) an entropy-gated voting and pairwise adjudication mechanism that resolves conflicts among multiple prompts, reducing ambiguity and increasing reliability. These innovations collectively enhance the zero-shot accident understanding pipeline, making it more accurate, interpretable, and adaptable.

Novelty

This work is pioneering in integrating multi-view structured prompts with metadata-conditioned spatial localization within a unified zero-shot framework for accident understanding. Unlike prior approaches that rely solely on end-to-end models or single prompts, this method decomposes the task into interpretable subcomponents, each optimized independently. The combination of vision-language similarity for temporal detection, multi-prompt voting with adjudication, and open-vocabulary spatial grounding conditioned on accident type and scene context, represents a significant departure from existing methods, establishing a new paradigm for robust, explainable accident analysis.

Limitations

The model’s performance degrades under adverse weather conditions such as heavy rain or fog, primarily due to reduced visibility affecting detection and localization accuracy.
The spatial localization relies heavily on OWL-v2, which may miss detections or produce false positives in cluttered or occluded scenes, limiting precision.
The current pipeline is computationally intensive, requiring high-end GPUs and significant inference time, which may hinder real-time deployment in resource-constrained environments.
Handling multiple simultaneous impacts or highly dynamic scenes remains challenging, indicating the need for more sophisticated temporal modeling.
The approach depends on accurate scene metadata; inaccuracies or missing metadata could impair localization and classification performance.

Future Work

Future research will explore integrating additional sensor modalities such as LiDAR and radar to improve robustness in poor visibility conditions. Enhancing temporal modeling with transformer-based architectures could better capture complex impact sequences. Developing lightweight models and optimizing inference pipelines will be crucial for real-time deployment. Moreover, expanding the dataset to include more diverse accident scenarios and adverse conditions will help improve generalization. Investigating unsupervised or semi-supervised learning strategies to reduce reliance on metadata accuracy is also a promising direction.

AI Executive Summary

Understanding accidents from surveillance videos is a critical challenge with profound implications for traffic safety, autonomous driving, and emergency response. Traditional methods often struggle in real-world scenarios due to the complexity and variability of accident scenes, especially under zero-shot conditions where labeled data is scarce or unavailable. Existing vision-language models like CLIP have demonstrated remarkable zero-shot classification capabilities, but their direct application to accident understanding remains limited by the complexity of the task.

This paper introduces a novel three-stage pipeline that decomposes the accident understanding process into temporal localization, semantic classification, and spatial grounding. The first stage employs a contrastive vision-language similarity measure, combined with motion cues from Meta’s Perception Encoder, to identify a short window of frames likely containing the impact event. This step effectively narrows down the temporal search space, focusing computational resources on relevant segments.

In the second stage, the authors leverage a structured multi-prompt scheme with five complementary prompts—baseline, motion, geometry, contrast, and tiebreaker—to classify the accident type. Each prompt emphasizes different scene attributes, and their predictions are aggregated through an entropy-gated voting mechanism that resolves conflicts and ambiguities. This multi-view reasoning enhances robustness, especially in ambiguous or conflicting scenarios.

The final stage localizes the impact region using an open-vocabulary detector (OWL-v2), conditioned on the predicted accident type and scene layout metadata. By pooling detections across multiple keyframes and computing a score-weighted centroid, the system produces a precise spatial localization of the impact point. This task decomposition and multi-modal fusion significantly outperform traditional single-prompt models.

Experimental results on the CVPR 2026 zero-shot ACCIDENT benchmark demonstrate a substantial performance boost, with the harmonic mean score reaching 0.4015, a notable improvement over baseline methods. Ablation studies confirm that each component—temporal window expansion, multi-prompt voting, and type-conditioned localization—contributes meaningfully to the overall performance.

This work marks a significant step forward in zero-shot accident understanding, addressing key challenges in robustness, interpretability, and generalization. Its modular design allows for easy upgrades and integration with future sensor modalities and modeling techniques. The approach’s ability to operate without fine-tuning and its reliance on open models make it highly practical for deployment in real-world surveillance and autonomous systems.

Looking ahead, future research will focus on improving robustness under adverse weather, reducing computational costs for real-time applications, and expanding datasets to cover more diverse accident scenarios. Overall, this framework opens new avenues for intelligent traffic monitoring and accident analysis, promising safer roads and smarter cities.

Deep Dive

Plain Language Accessible to non-experts

想象你在一个工厂里工作，每天都要确保机器正常运转。工厂里有很多不同的传感器和监控摄像头，用来观察每台机器的状态。有时候，机器突然出现故障，比如发出奇怪的声音或停止工作。为了找到问题所在，你不会只看一个镜头，而是会用多个不同的工具和方法来分析：比如观察机器的运动轨迹、接触点、角度，甚至用特殊的放大镜查看具体的部位。

这个系统就像是工厂的智能助手，它会先找到可能出问题的时间段，就像是用一个特别的“时间筛子”筛出可能的故障瞬间。然后，它会用五个不同的“观察员”——每个都专注于不同的线索，比如运动、几何关系、对比度等，来判断事故的类型。每个观察员会给出自己的结论，系统会投票决定哪个最靠谱。如果大家都犹豫不决，它还会用最后的“裁判”来帮忙做决定。

最后，这个系统会在工厂的平面图上标出故障的具体位置，就像用雷达扫描出故障点一样。这样一来，即使没有提前告诉它具体的故障类型，它也能通过多角度、多信息的分析，准确找到问题所在。这种多工具合作的方法，比单一工具更可靠，也更智能。未来，随着技术的不断发展，它还能变得更快、更准，帮助工厂更安全、更高效地运转。

ELI14 Explained like you're 14

想象你在学校参加一个侦探游戏，老师让你用不同的线索判断一件事，比如：发生了什么、在哪里发生、什么时候发生。你可以用五个不同的小助手：第一个看时间，第二个观察现场的情况，第三个注意角度，第四个用排除法，第五个在大家都不确定时帮忙决定。

这就像论文里的五个“提示”工具，每个都专注于不同的线索，比如车辆的运动轨迹、碰撞的角度、接触的部位等等。每个工具都给出自己的答案，然后系统会投票决定哪个答案最靠谱。如果大家都很犹豫，就用最后的“裁判”帮忙做决定。

最后，系统会在交通图上标出事故发生的具体位置，就像用放大镜找到问题点一样。这样一套多工具合作的方法，比用一个工具单独判断要靠谱得多。它就像你和朋友一起合作解决难题，每个人用不同的角度看问题，最后大家一起决定答案。这个方法特别聪明，能在没有提前告诉它答案的情况下，自己找到事故的时间、地点和类型。未来，这样的系统可以帮助交通监控变得更智能，让我们的出行更安全！

Abstract

In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.

cs.CV cs.AI stat.ML

References (20)

ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos

Lukás Picek, Michal vCerm'ak, Marek Hanzl et al.

2026 3 citations ⭐ Influential View Analysis →

Ask Me Anything: A simple strategy for prompting language models

Simran Arora, A. Narayan, Mayee F. Chen et al.

2022 270 citations View Analysis →

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Aaron Lohner, Francesco Compagno, Jonathan Francis et al.

2024 14 citations View Analysis →

DRAMA: Joint Risk Localization and Captioning in Driving

Srikanth Malla, Chiho Choi, Isht Dwivedi et al.

2022 194 citations View Analysis →

When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis

Ruixuan Zhang, Beichen Wang, Juexiao Zhang et al.

2025 39 citations View Analysis →

Grounding Human-To-Vehicle Advice for Self-Driving Vehicles

Jinkyu Kim, Teruhisa Misu, Yi-Ting Chen et al.

2019 127 citations View Analysis →

Language Models

Jordan Boyd-Graber, Philipp Koehn

2009 1114 citations

Explainable Object-Induced Action Decision for Autonomous Vehicles

Yiran Xu, Xiaoyin Yang, Lihang Gong et al.

2020 156 citations View Analysis →

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun et al.

2025 256 citations View Analysis →

Textual Explanations for Self-Driving Vehicles

Jinkyu Kim, Anna Rohrbach, Trevor Darrell et al.

2018 450 citations View Analysis →

When, Where, and What? A New Dataset for Anomaly Detection in Driving Videos

Yu Yao, Xizi Wang, Mingze Xu et al.

2020 50 citations View Analysis →

VQA: Visual Question Answering

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol et al.

2015 6456 citations View Analysis →

Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning

Vasili Ramanishka, Yi-Ting Chen, Teruhisa Misu et al.

2018 332 citations View Analysis →

Drive-CLIP: Cross-Modal Contrastive Safety-Critical Driving Scenario Representation Learning and Zero-Shot Driving Risk Analysis

Wenbin Gan, Minh-son Dao, Koji Zettsu

2024 9 citations

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy et al.

2021 50097 citations View Analysis →

VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding

Younggun Kim, Ahmed S. Abdelrahman, Mohamed A. Abdel-Aty

2025 14 citations View Analysis →

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans et al.

2022 6880 citations View Analysis →

CADP: A Novel Dataset for CCTV Traffic Camera based Accident Analysis

Ankit Shah, Jean-Baptiste Lamare, Tuan Nguyen-Anh et al.

2018 154 citations View Analysis →

Anticipating Accidents in Dashcam Videos

Fu-Hsiang Chan, Yu-Ting Chen, Yu Xiang et al.

2016 276 citations

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao et al.

2023 4203 citations View Analysis →

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence