Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

TL;DR

Proposed VisAnomReasoner fine-tuned on VisAnomBench achieves 74.30% precision and 72.17% F1 in time-series anomaly detection, surpassing baselines by over 21 and 23 points.

cs.AI 🔴 Advanced 2026-05-29 40 views

Xiaona Zhou Muntasir Wahed Tianjiao Yu Constantin Brif Ismini Lourentzou

AI Reader Arxiv Page Download PDF

Time-Series Analysis Vision-Language Models Anomaly Detection Multimodal Reasoning Model Fine-tuning

Key Findings

Methodology

This paper introduces a multimodal framework for time-series anomaly detection based on vision-language models (VLMs). The core is the construction of VisAnomBench, a comprehensive dataset integrating multiple public time-series datasets across diverse domains, enriched with high-quality natural language explanations grounded in visual evidence. The approach involves fine-tuning a parameter-efficient VLM—VisAnomReasoner—using a reward-driven supervised learning paradigm. The model employs a multi-layer Transformer architecture, combining visual feature extractors with natural language generation modules, to jointly localize anomalies and produce interpretable explanations. During training, the model is supervised with structured outputs that include anomaly intervals and step-by-step rationales, optimized via a composite reward function that measures temporal overlap, visual groundedness, axis consistency, and explanation clarity. This enables the model to learn not only where anomalies occur but also why they are abnormal, grounded in visual evidence from time-series plots. The training process involves multiple large VLMs generating candidate outputs, which are then filtered through a reward mechanism to select high-quality supervision signals, ensuring robustness and interpretability.

Key Results

On the VisAnomBench benchmark, the 7B variant of VisAnomReasoner outperforms all baseline models, with precision and F1 scores exceeding 74.30% and 72.17%, respectively, representing improvements of at least 21.23 and 23.87 percentage points over the best general large VLMs like GPT-4o-based models. The model also demonstrates superior anomaly localization accuracy, with overlap scores surpassing all competitors by at least 6.26 percentage points, indicating tighter temporal boundary alignment.
In the TSB-AD-U dataset, the 7B version achieves a precision of 60.91% and an F1 of 62.91%, outperforming the second-best models by 9.57 and 13.39 points, respectively. These results confirm strong cross-benchmark generalization. Ablation studies reveal that supervised fine-tuning with explanation supervision significantly reduces false positives, leading to a 13-point F1 improvement, highlighting the importance of explanation-grounded training.
Comparative analysis with classical detectors like IForest and Matrix Profile shows that VisAnomReasoner maintains a balanced trade-off between recall and precision, with notably higher boundary localization accuracy. The model's explanations are preferred in 69.6% of cases, demonstrating enhanced interpretability and user trust.

Significance

This work marks a significant advancement in the integration of multimodal reasoning into time-series anomaly detection. By grounding anomaly localization and explanation generation in visual evidence, it addresses longstanding issues of interpretability and reliability. The approach enhances transparency in critical applications such as industrial process monitoring, healthcare diagnostics, and cybersecurity, where understanding the 'why' behind anomalies is as important as detection itself. The methodology paves the way for future research on multimodal, explainable AI systems capable of handling complex, real-world sequential data, bridging the gap between high-performance detection and human-understandable reasoning.

Technical Contribution

The paper's main technical contributions include: (1) framing time-series anomaly detection as a vision-language reasoning task, enabling joint localization and explanation; (2) constructing VisAnomBench, a large, annotated dataset with natural language rationales aligned with anomaly intervals; (3) designing VisAnomReasoner, a parameter-efficient Transformer-based model that leverages explanation-augmented supervision via a composite reward function to improve detection accuracy and interpretability; (4) demonstrating superior performance over 15 baselines, including large general-purpose VLMs, specialized anomaly detectors, and classical methods, across multiple datasets and metrics; (5) establishing a new paradigm for integrating visual evidence and natural language reasoning in sequential anomaly detection, with potential for broad applications.

Novelty

This research introduces the novel concept of transforming time-series anomaly detection into a multimodal reasoning problem, emphasizing the joint localization and explanation grounded in visual evidence. Unlike prior works that rely solely on scalar scores or binary labels, this approach leverages natural language explanations aligned with visual features, enabling models to produce interpretable, plot-consistent rationales. The creation of VisAnomBench as an explanation-augmented dataset is also pioneering, providing rich supervision signals that significantly enhance model performance and interpretability. This represents a fundamental shift from traditional anomaly detection paradigms, opening new avenues for explainable AI in sequential data analysis.

Limitations

Despite impressive results, the model's performance may degrade in scenarios with extremely noisy or highly complex signals where visual cues are ambiguous or misleading, potentially affecting explanation quality.
The computational cost of generating and evaluating multiple candidate outputs during training is high, which may limit scalability to very large datasets or real-time applications.
Current models are primarily validated on publicly available datasets; their robustness in industrial or domain-specific environments with unique data distributions remains to be further tested.

Future Work

Future research will focus on enhancing the model's robustness in noisy or highly variable environments, optimizing inference speed for real-time deployment, and extending多模态推理能力以支持多任务学习（如预测、分类等）。此外，还计划引入主动学习机制，以适应动态变化的时间序列数据，提升模型的适应性和持续学习能力。探索多模态信息融合的更深层次机制，结合更多类型的视觉和文本信息，将进一步推动多模态AI在工业智能、医疗监测等领域的应用落地。

AI Executive Summary

时间序列异常检测在工业自动化、医疗诊断和网络安全等多个关键行业中扮演着核心角色。传统方法多依赖数值指标或二元标签，难以提供异常背后的具体原因，限制了其在实际应用中的透明度和可信度。近年来，视觉-语言模型（VLMs）在多模态推理方面取得了突破，能够结合视觉证据和自然语言生成，展现出强大的理解和解释能力。然而，直接将这些模型应用于时间序列异常检测面临诸多挑战，包括时间序列缺乏明确的空间结构、缺少细粒度的监督信号，以及模型难以实现端到端的异常定位与解释。为应对这些挑战，本文提出了VisAnomReasoner，一种基于微调的参数高效模型，专门设计用于时间序列的异常检测。核心创新在于构建了名为VisAnomBench的多领域、多类型异常解释增强数据集，结合基于奖励的训练机制，使模型能够同时实现异常区间的精确定位和合理解释。该模型采用多层Transformer架构，融合视觉特征提取和自然语言生成能力，经过在多个公开数据集上的验证，显著优于现有的基线模型。在VisAnomBench的测试中，7B版本的VisAnomReasoner在异常定位的精确率和F1指标上分别比最优对比模型提升了超过21和23个百分点，显示出其在跨场景泛化和解释合理性方面的优越性。这一研究不仅推动了多模态AI在时间序列分析中的应用，也为工业智能监控、医疗诊断等领域提供了更具透明度和可信度的解决方案。未来，模型将继续优化鲁棒性和效率，拓展多任务能力，助力智能系统的全面升级。

Deep Dive

Abstract

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in sequential data. Public anomaly detection benchmarks typically provide interval annotations but not natural-language rationales, making it difficult to fine-tune VLMs to produce grounded, interpretable decisions. To address this gap, we construct VisAnomBench, a curated benchmark built from public time-series datasets and augmented with high-quality anomaly explanations selected from multiple large VLMs using fine-grained, task-specific rewards. Through fine-tuning on this benchmark, we develop VisAnomReasoner, a parameter-efficient VLM for time-series anomaly detection. Experimental results on VisAnomBench show that VisAnomReasoner achieves more accurate anomaly localization and consistently outperforms all baselines, with improvements of at least 21.23 and 23.87 percentage points in precision and F1, respectively. Additional experiments on the TSB-AD-U benchmark demonstrate strong cross-benchmark generalization, with VisAnomReasoner improving precision and F1 by 9.57 and 13.39 percentage points, respectively.

cs.AI

References (20)

TSB-UAD: An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection

John Paparrizos, Yuhao Kang, Paul Boniol et al.

2022 139 citations ⭐ Influential

Anomaly Detection in Time Series: A Comprehensive Evaluation

Sebastian Schmidl, Phillip Wenig, Thorsten Papenbrock

2022 579 citations ⭐ Influential

Robotic Visual Instruction

Yanbang Li, Ziyang Gong, Haoyang Li et al.

2025 22 citations ⭐ Influential View Analysis →

Can LLMs Understand Time Series Anomalies?

Zihao Zhou, Rose Yu

2024 46 citations ⭐ Influential View Analysis →

Harnessing Vision-Language Models for Time Series Anomaly Detection

Zelin He, Sarah Alnegheimish, Matthew Reimherr

2025 11 citations ⭐ Influential View Analysis →

LERa: Replanning with Visual Feedback in Instruction Following

S. Pchelintsev, Maxim A. Patratskiy, Anatoly Onishchenko et al.

2025 8 citations ⭐ Influential View Analysis →

Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy

Jiehui Xu, Haixu Wu, Jianmin Wang et al.

2021 1000 citations View Analysis →

Can Multimodal LLMs Perform Time Series Anomaly Detection?

Xiongxiao Xu, Haoran Wang, Yueqing Liang et al.

2025 19 citations View Analysis →

A decoder-only foundation model for time-series forecasting

Abhimanyu Das, Weihao Kong, Rajat Sen et al.

2023 658 citations View Analysis →

Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement

Yaxuan Kong, Yiyuan Yang, Yoontae Hwang et al.

2025 65 citations View Analysis →

Effective Training Data Synthesis for Improving MLLM Chart Understanding

Yuwei Yang, Zeyu Zhang, Yunzhong Hou et al.

2025 24 citations View Analysis →

SmolVLM: Redefining small and efficient multimodal models

Andrés Marafioti, Orr Zohar, Miquel Farr'e et al.

2025 206 citations View Analysis →

Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction

M. Sakurada, T. Yairi

2014 1337 citations

Gemma 3 Technical Report

Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak et al.

2025 1404 citations View Analysis →

Temporal signals to images: Monitoring the condition of industrial assets with deep learning image processing algorithms

G. R. Garcia, Gabriel Michau, Mélanie Ducoffe et al.

2020 64 citations

AXIS: Explainable Time Series Anomaly Detection with Large Language Models

Tian Lan, Hao Duong Le, Jinbo Li et al.

2025 2 citations View Analysis →

Ensemble Grammar Induction For Detecting Anomalies in Time Series

Yifeng Gao, Jessica Lin, C. Brif

2020 17 citations View Analysis →

Contextual and Seasonal LSTMs for Time Series Anomaly Detection

Ling Zhang, Qingming Li, Yong Yang et al.

2026 1 citations View Analysis →

TempoGPT: Enhancing Time Series Reasoning via Quantizing Embedding

Haochuan Zhang, Chunhua Yang, Jie Han et al.

2025 7 citations View Analysis →

TimeMaster: Training Time-Series Multimodal LLMs to Reason via Reinforcement Learning

Junru Zhang, Lang Feng, Xu Guo et al.

2025 15 citations View Analysis →

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Iteris: Agentic Research Loops for Computational Mathematics

Choosing the Lens: Strategic Perspective Activation in Context-Dependent Argumentation

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

Calibrating Conservatism for Scalable Oversight

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation