Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

TL;DR

Proposed DeltaDirect with MoDirect dataset boosts motion direction accuracy from 25.9% to 85.4% on synthetic domain.

cs.CV 🔴 Advanced 2026-05-22 55 views

Jongseo Lee Hyuntak Lee Sunghun Kim Sooa Kim Jihoon Chung Jinwoo Choi

video understanding large language models motion direction recognition vision-language models instruction tuning

Key Findings

Methodology

This paper diagnoses a systematic failure in Video Large Language Models (Video-LLMs) to correctly recognize signed image-plane motion directions. The authors construct the MoDirect dataset family, comprising synthetic and real-world subsets with controlled foreground and background variations, designed as multiple-choice questions (MCQ) to test left, right, up, and down motion recognition. They trace motion direction information through the Video-LLM pipeline, showing that while the vision encoder, projector, and LLM hidden states retain linearly decodable motion signals, the final language output fails to bind these signals to the correct verbal answer, termed the 'direction binding gap.' To address this, they propose DeltaDirect, a training-only auxiliary objective that predicts normalized 2-D motion vectors from adjacent-frame projector feature deltas, thereby strengthening signed displacement cues at the vision-language interface. Training combines standard next-token prediction loss with the motion vector prediction loss, updating only the projector and prediction head, while inference remains unchanged.

Key Results

On the MoDirect-SynBench synthetic dataset, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%, significantly outperforming baseline models.
On the MoDirect-RealBench real-world dataset, DeltaDirect achieves a 21.9 percentage point improvement in motion direction accuracy over the vanilla baseline without requiring real-world motion direction labels, while preserving overall video understanding performance.
Ablation studies reveal that applying motion vector prediction supervision at the projector output yields the best results, and that directly predicting normalized 2-D motion vectors outperforms alternative motion signals, validating the design choices.

Significance

This work is the first to systematically reveal the severe deficiency of current Video-LLMs in basic motion direction perception, identifying the 'direction binding gap' between visual encoding and language output. The proposed DeltaDirect training strategy effectively strengthens motion signals at the vision-language interface, substantially enhancing models' understanding of motion direction. This advances foundational perceptual capabilities in video understanding and provides critical insights for future multimodal models to better integrate perception and language. The findings have broad implications for both academic research and industrial applications requiring precise dynamic scene understanding.

Technical Contribution

Key technical contributions include: 1) precisely localizing the failure in motion direction recognition to the language readout stage via linear probing and concept vector analysis; 2) introducing the MoDirect dataset family for controlled instruction tuning and evaluation across synthetic and real domains; 3) designing DeltaDirect, a projector-level auxiliary training objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas to reinforce signed displacement cues; 4) integrating this auxiliary loss with standard language modeling loss during training while keeping inference unchanged; 5) empirically demonstrating significant accuracy gains on motion direction tasks without degrading general video understanding, establishing a new paradigm for training-driven vision-language interface enhancement.

Novelty

This study is the first to systematically diagnose the motion direction blindness in Video-LLMs and conceptualize the 'direction binding gap,' highlighting that motion signals exist internally but are not effectively utilized by the language module. Unlike prior approaches that add motion-specific encoders or tokens at inference, DeltaDirect strengthens motion signals during training at the vision-language interface without increasing inference complexity. This novel approach achieves efficient and robust motion direction understanding, filling a critical gap in existing Video-LLMs' foundational perceptual abilities.

Limitations

The method experiences accuracy degradation in highly complex real-world scenes due to weakened motion signal magnitude, indicating limited generalization under visual complexity.
DeltaDirect relies on synthetic motion direction annotations for training, and scarcity of real-world labeled data may constrain direct applicability in some domains.
Current work focuses on four cardinal motion directions and does not address more complex or 3D motion patterns.

Future Work

Future research directions include enhancing motion signal robustness and generalization in complex real-world scenarios, exploring self-supervised or weakly supervised approaches to reduce dependence on labeled data, extending the framework to cover richer motion types and 3D motion understanding, and integrating temporal reasoning for comprehensive dynamic scene interpretation in Video-LLMs.

AI Executive Summary

Video Large Language Models (Video-LLMs) have recently achieved remarkable progress in temporal video understanding, enabling sophisticated reasoning over dynamic scenes. However, this paper uncovers a fundamental blind spot: these models fail to reliably recognize basic signed motion directions—left, right, up, and down—in simple videos featuring a single moving object. Despite humans effortlessly perceiving such motion, most Video-LLMs perform near chance levels (~25%), with occasional above-chance results attributed to prediction biases rather than true understanding. This systematic failure is termed “directional motion blindness.”

To diagnose this issue, the authors trace motion direction information through the Video-LLM pipeline, including the vision encoder, projector, and language model hidden states. They find that motion direction signals remain linearly decodable at all stages except the final language output, where the model fails to bind the perceived direction to the correct verbal answer option. This “direction binding gap” is a structural limitation across multiple Video-LLMs, indicating that the problem lies not in perception but in the integration of motion signals into language generation.

To address this, the researchers construct MoDirect, a dataset family with four subsets combining synthetic and real backgrounds and foregrounds, designed as multiple-choice questions with randomized answer orders to rigorously test motion direction binding. Instruction tuning on the simplest synthetic domain improves binding on that domain but generalizes poorly to more complex ones. Concept vector analysis reveals that while the orientation of motion direction representations aligns across domains, their magnitude diminishes with visual complexity, weakening the signal and reopening the binding gap.

Motivated by this insight, the authors propose DeltaDirect, a training-only auxiliary objective applied at the projector output layer. DeltaDirect predicts normalized 2-D motion vectors from adjacent-frame feature deltas, directly supervising signed displacement cues before they enter the language model. This auxiliary branch is discarded at inference, preserving the original model architecture and input format. Experiments demonstrate that DeltaDirect dramatically improves motion direction accuracy from 25.9% to 85.4% on MoDirect-SynBench and yields a 21.9-point gain on MoDirect-RealBench without real-world motion labels, while maintaining or improving general video understanding benchmarks.

This work not only reveals a critical perceptual-language integration gap in Video-LLMs but also offers a practical, effective solution that enhances foundational motion understanding without inference overhead. The findings pave the way for more robust multimodal models capable of precise dynamic scene interpretation, with implications for video question answering, robotics, and autonomous systems. Future work may focus on improving generalization to complex real-world scenes, reducing reliance on labeled data, and extending motion understanding to richer and 3D motion patterns.

Deep Analysis

Background

Video Large Language Models (Video-LLMs) integrate visual encoders with large language models to enable understanding and generation of language grounded in video content. Recent advances, including models like LLaVA-Video and Gemini2.5-Flash, have demonstrated impressive capabilities in temporal reasoning, action recognition, and long-form video comprehension. Benchmarks such as Something-Something V2 (SSv2), KTH action dataset, and TOMATO evaluate these temporal and motion-related abilities. Despite these advances, foundational perceptual primitives like signed image-plane motion direction recognition remain underexplored. Motion direction is critical for visual navigation, physical interaction, and temporal reasoning, serving as a building block for complex video understanding. The gap in evaluating and improving this primitive limits the overall effectiveness and reliability of Video-LLMs in dynamic scene interpretation.

Core Problem

The core problem addressed is the inability of current Video-LLMs to accurately recognize basic signed motion directions—left, right, up, and down—in videos featuring a single moving object. Although seemingly trivial, experiments reveal that most models perform near random chance (~25%) on this task. The bottleneck is not the absence of motion direction information—linear probes confirm its presence in visual encoder outputs, projector outputs, and LLM hidden states—but rather the failure of the final language output token to bind this internal motion signal to the correct prompt-specific answer option. This 'direction binding gap' prevents the model from verbalizing its perceptual understanding correctly, undermining downstream tasks requiring precise motion comprehension.

Innovation

This work introduces several key innovations: 1) The identification and formalization of the 'direction binding gap,' a novel conceptual framework pinpointing the failure mode in Video-LLMs' motion direction recognition as a language readout binding issue rather than perceptual absence. 2) The creation of the MoDirect dataset family, comprising four controlled subsets combining synthetic and real foregrounds and backgrounds, designed as randomized multiple-choice questions to rigorously test motion direction binding. 3) The use of concept vector analysis to reveal that while motion direction representations share orientation across domains, their magnitude diminishes with increasing visual complexity, explaining out-of-domain generalization failures. 4) The design of DeltaDirect, a training-only auxiliary objective applied at the projector output layer that predicts normalized 2-D motion vectors from adjacent-frame feature deltas, directly reinforcing signed displacement cues before language decoding. 5) A training regime integrating standard next-token prediction with motion vector prediction loss, improving motion direction accuracy without altering inference-time architecture or input formats.

Methodology

�� Dataset Construction: MoDirect comprises four subsets—Primitive-on-Syn (geometric shapes on uniform backgrounds), Cutout-on-Syn (real object cutouts on uniform backgrounds), Primitive-on-Real (geometric shapes on natural backgrounds), and Cutout-on-Real (real object cutouts on natural backgrounds). Each video contains a single object moving in one of four directions.

�� Task Design: Multiple-choice questions with randomized answer option orders require models to bind perceived motion direction to the correct textual answer, preventing fixed label shortcuts.

�� Diagnostic Analysis: Linear probing classifiers are trained on frozen representations from the vision encoder, projector, and LLM hidden states to assess motion direction information accessibility.

�� Concept Vector Analysis: Difference-in-means vectors are computed per motion direction and domain to assess orientation alignment and magnitude differences across domains.

�� DeltaDirect Auxiliary Objective: At the projector output, adjacent-frame feature differences are spatially pooled to form motion-change descriptors. A lightweight prediction head regresses normalized 2-D motion vectors representing signed displacement directions.

�� Training Procedure: The total loss combines standard next-token cross-entropy with mean squared error for motion vector prediction. Only the projector and prediction head are updated; the vision encoder and LLM weights remain frozen.

�� Inference: The auxiliary prediction head is removed, preserving original input formats, model architecture, and decoding procedures.

Experiments

Experiments are conducted on MoDirect synthetic and real subsets, as well as real-world benchmarks including Something-Something V2 (SSv2), TOMATO, and KTH datasets. Baseline models include LLaVA-Video-7B, Gemini2.5-Flash, and others. Evaluation metric is multiple-choice question accuracy on motion direction recognition. Models are fine-tuned with LoRA adapters on the projector and LLM, with the vision encoder frozen. Ablation studies compare supervision applied at different layers (vision encoder, pre/post-projector, LLM hidden states, final readout) and different motion signal formulations (feature concatenation, delta equivariance, normalized motion vectors). Hyperparameters and training details are provided in the appendix. The impact on general video understanding benchmarks is also assessed to verify no degradation.

Results

DeltaDirect instruction tuning improves motion direction accuracy on MoDirect-SynBench from 25.9% to 85.4%, a dramatic gain over baselines. On MoDirect-RealBench, it achieves a 21.9-point improvement without real-world motion labels, demonstrating strong zero-shot transfer. Ablations show that applying supervision at the projector output yields the best performance, while readout layer supervision is ineffective despite the binding gap there. Predicting normalized 2-D motion vectors outperforms other motion signal targets. Fine-tuned models maintain or slightly improve performance on standard and fine-grained video understanding benchmarks, indicating that motion direction supervision complements rather than compromises general video comprehension.

Applications

The proposed method directly benefits video question answering systems by enabling accurate motion direction comprehension, improving answer correctness and user experience. It enhances intelligent surveillance systems' ability to detect and interpret object motion, aiding anomaly detection and event prediction. In robotics, improved motion direction perception supports visual navigation and obstacle avoidance, critical for autonomous operation. The approach's inference-time transparency facilitates integration into existing Video-LLMs deployed in autonomous driving, augmented reality, and other dynamic scene understanding applications.

Limitations & Outlook

Despite significant improvements, DeltaDirect's accuracy decreases in highly complex real-world scenes due to diminished motion signal magnitude, limiting generalization. The reliance on synthetic motion direction annotations constrains applicability where real-world labeled data is scarce. The current scope is limited to four cardinal motion directions, excluding complex trajectories and 3D motion patterns. Further work is needed to address these limitations and extend the framework's applicability.

Abstract

Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: https://github.com/KHU-VLL/DeltaDirect

cs.CV

Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence