VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

TL;DR

VideoSeek actively seeks critical evidence using video logic flow, reducing frame usage by 93% and improving LVBench accuracy by 10.2 points.

cs.CV 🔴 Advanced 2026-03-21 50 views

Jingyang Lin Jialian Wu Jiang Liu Ximeng Sun Ze Wang Xiaodong Yu Jiebo Luo Zicheng Liu Emad Barsoum

AI Reader Arxiv Page Download PDF

video understanding long-horizon tool-guided logic flow multimodal models

Key Findings

Methodology

VideoSeek is a long-horizon video agent model that actively seeks critical evidence using video logic flow instead of densely parsing the video. Its core is a think-act-observe loop combined with a well-designed multi-granular toolkit to support multi-granular video observations. The toolkit includes an overview tool, a skim tool, and a focus tool, each serving to establish a global video overview, coarsely scan candidate segments, and deeply analyze short clips. This approach allows VideoSeek to maintain or even improve video understanding capabilities while using fewer frames.

Key Results

VideoSeek achieved a 10.2-point improvement in accuracy over its base model GPT-5 on LVBench while reducing frame usage by 93%. Without subtitles, VideoSeek used 92.3 frames to achieve 68.4% accuracy, and with subtitles, frame usage dropped to 27.2 frames with accuracy reaching 76.7%.
On VideoMME and LongVideoBench, VideoSeek used 60.9 and 29.6 frames respectively, achieving accuracies of 70.1% and 73.5%, significantly outperforming other multimodal models and video agents.
On the complex video reasoning benchmark Video-Holmes, VideoSeek achieved an overall accuracy of 47.3% using only 42.7 frames, surpassing strong models including Gemini 2.5 Pro.

Significance

The significance of VideoSeek lies in its innovative approach to video understanding, which improves accuracy while reducing computational costs. This method addresses the high computational cost issue of traditional video agent models in long video processing and demonstrates broad applicability in multimodal video understanding tasks. By leveraging video logic flow, VideoSeek not only enhances model efficiency but also provides new insights for video understanding and reasoning.

Technical Contribution

VideoSeek's technical contributions include its innovative use of video logic flow for evidence seeking, avoiding the high cost of dense video parsing. Its multi-granular toolkit design allows the model to flexibly observe video content at different granularities, achieving more efficient reasoning and understanding. Additionally, VideoSeek demonstrates its ability to maintain or even improve accuracy while reducing frame usage across multiple benchmarks, proving its effectiveness in long-horizon video understanding tasks.

Novelty

VideoSeek's novelty lies in its active seeking of critical evidence through video logic flow, rather than relying on dense video parsing. This approach fundamentally differs from the single-pass paradigm of traditional video agent models, offering a more efficient method for long-horizon video understanding and reasoning.

Limitations

VideoSeek may perform poorly on videos without clear logic flow, as it relies on video logic flow to guide evidence seeking.
In some cases, the toolkit's selection may not be flexible enough, leading to over- or under-analysis of certain video segments.
While VideoSeek excels in reducing frame usage, further optimization may be needed for extremely long videos to maintain efficiency.

Future Work

Future research directions include further optimizing VideoSeek's performance on extremely long videos, exploring a wider variety of video types and scenarios, and improving the toolkit's flexibility to adapt to different video content. Additionally, integrating other multimodal signals (such as audio) to enhance video understanding capabilities is a promising direction.

AI Executive Summary

Video understanding is a complex task, especially in long-horizon videos, where traditional methods often require dense parsing, leading to high computational costs. Existing large multimodal models have made progress in video-language tasks but still face challenges in handling long videos and complex reasoning tasks.

VideoSeek is an innovative long-horizon video agent model that actively seeks critical evidence using video logic flow instead of densely parsing the video. Its core is a think-act-observe loop combined with a well-designed multi-granular toolkit to support multi-granular video observations. The toolkit includes an overview tool, a skim tool, and a focus tool, each serving to establish a global video overview, coarsely scan candidate segments, and deeply analyze short clips.

The technical principle of VideoSeek lies in guiding evidence seeking through video logic flow, thereby reducing frame usage. This approach allows the model to maintain or even improve video understanding capabilities while reducing computational costs. Experimental results show that VideoSeek performs excellently across multiple benchmarks, particularly achieving a 10.2-point improvement in accuracy over its base model GPT-5 on LVBench while reducing frame usage by 93%.

The broad applicability and efficiency of this method have significant impacts on the field of video understanding. VideoSeek not only addresses the high computational cost issue of traditional video agent models in long video processing but also provides new insights for video understanding and reasoning. Its ability to improve accuracy while reducing frame usage demonstrates its wide applicability in multimodal video understanding tasks.

However, VideoSeek may perform poorly on videos without clear logic flow, as it relies on video logic flow to guide evidence seeking. Additionally, further optimization may be needed for extremely long videos to maintain efficiency. Future research directions include further optimizing VideoSeek's performance on extremely long videos, exploring a wider variety of video types and scenarios, and improving the toolkit's flexibility to adapt to different video content.

Deep Analysis

Background

Video understanding is a crucial research area in computer vision and natural language processing, with wide-ranging applications including multimodal assistants, autonomous driving, and vision-guided robotics. Recent advancements in large language models (LLMs) and large multimodal models (LMMs) have propelled progress in video-language understanding. However, existing methods predominantly follow a single-pass paradigm, which often falls short in handling long videos and complex reasoning tasks. Traditional video agent models typically rely on dense video parsing, leading to high computational costs, especially in long videos. Moreover, many existing methods lack flexibility in handling the diversity and complexity of video content.

Core Problem

The core problem in video understanding tasks is how to improve accuracy without increasing computational costs. Traditional methods often require dense video parsing, resulting in high computational costs. Additionally, existing video agent models often perform poorly in handling long videos and complex reasoning tasks due to their lack of flexibility in dealing with diverse and complex video content. Therefore, finding a way to improve video understanding accuracy while reducing frame usage is a pressing challenge.

Innovation

The core innovation of VideoSeek lies in its active seeking of critical evidence through video logic flow, rather than relying on dense video parsing. This approach fundamentally differs from the single-pass paradigm of traditional video agent models, offering a more efficient method for long-horizon video understanding and reasoning. VideoSeek employs a think-act-observe loop combined with a well-designed multi-granular toolkit to support multi-granular video observations. The toolkit includes an overview tool, a skim tool, and a focus tool, each serving to establish a global video overview, coarsely scan candidate segments, and deeply analyze short clips. This multi-granular toolkit design allows the model to flexibly observe video content at different granularities, achieving more efficient reasoning and understanding.

Methodology

The implementation of VideoSeek involves several key steps:

�� Think: At each step, the model reasons over the query and accumulated observations, plans the next action, and selects an appropriate tool.

�� Act: The model invokes the selected tool to gather new evidence from the video. The toolkit includes an overview tool, a skim tool, and a focus tool, each serving to establish a global video overview, coarsely scan candidate segments, and deeply analyze short clips.

�� Observe: The newly gathered evidence is fed back to the model, entering the next think-act-observe loop until sufficient evidence is collected to produce the final answer.

Experiments

The experimental design includes evaluating VideoSeek's performance on four challenging video understanding and reasoning benchmarks: LVBench, VideoMME, LongVideoBench, and Video-Holmes. The baseline model used is GPT-5, which is replaced by other alternative LLMs (e.g., o4-mini and GPT-4.1) in the ablation study. The experiments evaluate VideoSeek's ability to maintain or even improve accuracy while reducing frame usage. The experiments also analyze the impact of toolkit design and video logic flow on model performance.

Results

Experimental results show that VideoSeek performs excellently across multiple benchmarks, particularly achieving a 10.2-point improvement in accuracy over its base model GPT-5 on LVBench while reducing frame usage by 93%. On VideoMME and LongVideoBench, VideoSeek used 60.9 and 29.6 frames respectively, achieving accuracies of 70.1% and 73.5%, significantly outperforming other multimodal models and video agents. On the complex video reasoning benchmark Video-Holmes, VideoSeek achieved an overall accuracy of 47.3% using only 42.7 frames, surpassing strong models including Gemini 2.5 Pro.

Applications

VideoSeek has broad applications in the field of video understanding. Its efficient evidence-seeking capability makes it suitable for multimodal assistants, autonomous driving, and vision-guided robotics. Additionally, VideoSeek's ability to improve accuracy while reducing frame usage makes it valuable in long video processing and complex reasoning tasks.

Limitations & Outlook

Despite VideoSeek's excellent performance across multiple benchmarks, it may perform poorly on videos without clear logic flow, as it relies on video logic flow to guide evidence seeking. Additionally, further optimization may be needed for extremely long videos to maintain efficiency. Future research directions include further optimizing VideoSeek's performance on extremely long videos, exploring a wider variety of video types and scenarios, and improving the toolkit's flexibility to adapt to different video content.

Plain Language Accessible to non-experts

Imagine you're watching a movie, but you don't have time to watch it from start to finish. You might quickly skim the movie's synopsis to get a general idea of the storyline, then jump to the parts you think are important and watch them closely. VideoSeek is like a smart assistant that helps you quickly find the most important parts of the movie without wasting time watching the whole thing. It uses a think-act-observe loop to guide evidence seeking through the video's logic flow. First, it establishes a global overview of the video, like quickly skimming the movie's synopsis. Then, it coarsely scans segments that might contain important information, like jumping to key scenes in the movie. Finally, it deeply analyzes short clips that need close observation, like carefully watching the movie's climax. This approach allows VideoSeek to maintain or even improve video understanding capabilities while using fewer frames.

ELI14 Explained like you're 14

Hey there, have you ever thought about what it would be like if we could watch movies like a super detective? VideoSeek is like that super detective! It helps us quickly find the most important parts of a movie without having to watch it from start to finish. Imagine you're playing a super complex game, and you need to find hidden treasures. VideoSeek is like your game assistant, telling you where the treasures might be hidden so you don't waste time searching all over the place! It first quickly skims the entire game map, then tells you where the treasures might be, and finally takes you to those places to search carefully. Isn't that cool? This way, we can find the treasures faster and win the game!

Glossary

VideoSeek

A long-horizon video agent model that actively seeks critical evidence using video logic flow instead of densely parsing the video.

VideoSeek is used in the paper to improve video understanding efficiency and accuracy.

Logic Flow

The temporal and causal structure within a video used to guide evidence seeking and help the model quickly locate important segments.

Logic flow is used in VideoSeek to guide the model in selecting appropriate tools for evidence seeking.

Think-Act-Observe Loop

The core workflow of VideoSeek, involving continuous thinking, acting, and observing to gather evidence until the final answer is generated.

This loop is used in VideoSeek's evidence seeking process.

Overview Tool

One of VideoSeek's tools used to establish a global overview of the video, helping the model form an initial plan.

The overview tool is used in VideoSeek to quickly browse the overall structure of the video.

Skim Tool

One of VideoSeek's tools used to coarsely scan candidate segments, helping the model narrow down the search space.

The skim tool is used in VideoSeek to quickly locate segments that might contain important information.

Focus Tool

One of VideoSeek's tools used to deeply analyze short clips to obtain critical details.

The focus tool is used in VideoSeek to closely observe segments that need verification or precise information extraction.

LVBench

A video understanding and reasoning benchmark used to evaluate model performance on long videos.

LVBench is used in the paper to evaluate VideoSeek's performance.

VideoMME

A comprehensive multimodal benchmark for evaluating model performance in long video understanding.

VideoMME is used in the paper to evaluate VideoSeek's performance.

LongVideoBench

A long video understanding benchmark used to evaluate model performance on long videos.

LongVideoBench is used in the paper to evaluate VideoSeek's performance.

Video-Holmes

A complex video reasoning benchmark used to evaluate model performance on complex reasoning tasks.

Video-Holmes is used in the paper to evaluate VideoSeek's performance.

Open Questions Unanswered questions from this research

1 How to improve VideoSeek's performance on videos without clear logic flow? Current methods rely on video logic flow to guide evidence seeking, but may perform poorly on videos lacking logic flow.
2 How to further optimize VideoSeek's performance on extremely long videos? Although VideoSeek excels in reducing frame usage, further optimization may be needed for extremely long videos to maintain efficiency.
3 How to improve the toolkit's flexibility to adapt to different video content? The current toolkit may not be flexible enough in some cases, leading to over- or under-analysis of certain video segments.
4 How to integrate other multimodal signals (such as audio) to enhance video understanding capabilities? Current methods mainly rely on visual information from videos, and integrating other signals may further improve understanding capabilities.
5 How to maintain or even improve video understanding accuracy while reducing frame usage? Current methods excel in reducing frame usage, but further optimization may still be needed in some cases.

Applications

Immediate Applications

Multimodal Assistants

VideoSeek can be used in multimodal assistants to help users quickly access critical information in videos through efficient video understanding capabilities.

Autonomous Driving

In autonomous driving, VideoSeek can be used to analyze real-time video captured by onboard cameras, quickly identifying important information on the road.

Vision-Guided Robotics

VideoSeek can be used in vision-guided robotics to help robots quickly locate and identify target objects in complex environments.

Long-term Vision

Intelligent Surveillance Systems

VideoSeek can be used in intelligent surveillance systems to detect and identify abnormal behavior in real-time through efficient video analysis capabilities.

Film Production

In film production, VideoSeek can be used to quickly analyze and edit long videos, helping production teams improve work efficiency.

Abstract

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.

cs.CV cs.AI cs.CL

References (20)

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo et al.

2025 26 citations ⭐ Influential View Analysis →

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong et al.

2024 263 citations ⭐ Influential View Analysis →

MR. Video: "MapReduce" is the Principle for Long Video Understanding

Ziqi Pang, Yu-Xiong Wang

2025 10 citations ⭐ Influential View Analysis →

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng, Yuying Ge, Teng Wang et al.

2025 52 citations ⭐ Influential View Analysis →

DrVideo: Document Retrieval Based Long Video Understanding

Ziyu Ma, Chenhui Gou, Hengcan Shi et al.

2024 48 citations ⭐ Influential View Analysis →

MULTIMODAL BEHAVIOR THERAPY: TREATING THE “BASIC ID”

A. Lazarus

1973 142 citations

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay et al.

2024 107 citations View Analysis →

Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li et al.

2024 260 citations

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Jun Xu, Tao Mei, Ting Yao et al.

2016 2368 citations

A Survey on Vision-Language-Action Models for Autonomous Driving

Sicong Jiang, Zilin Huang, Kangan Qian et al.

2025 41 citations View Analysis →

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin et al.

2024 179 citations View Analysis →

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Peitian Zhang, Zheng Liu et al.

2024 174 citations View Analysis →

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu et al.

2023 8523 citations View Analysis →

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li et al.

2025 298 citations View Analysis →

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

2024 3580 citations View Analysis →

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou et al.

2024 304 citations View Analysis →

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu et al.

2022 6473 citations View Analysis →

Dense-Captioning Events in Videos

Ranjay Krishna, K. Hata, F. Ren et al.

2017 1485 citations View Analysis →

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Lin Chen, Xilin Wei, Jinsong Li et al.

2024 373 citations View Analysis →

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois et al.

2024 65 citations View Analysis →

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

VideoSeek

Logic Flow

Think-Act-Observe Loop

Overview Tool

Skim Tool

Focus Tool

LVBench

VideoMME

LongVideoBench

Video-Holmes

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Multimodal Assistants

Autonomous Driving

Vision-Guided Robotics

Long-term Vision

Intelligent Surveillance Systems

Film Production

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock