Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

Key Findings

Methodology

The paper introduces STEVO-Bench, a benchmark designed to evaluate video world models' ability to evolve state during observation interruptions. This benchmark applies observation control by inserting occluders, turning off lights, or specifying camera 'lookaway' trajectories. By comparing model performance with and without camera control, the study reveals current video world models' limitations in decoupling state evolution from observation.

Key Results

Result 1: Video models show less than 10% success rate in state evolution under observation control. For instance, Veo 3 and Sora 2 Pro have success rates of 8.7% and 8.1% respectively in state evolution tasks.
Result 2: Camera-controlled models almost never succeed in state evolution tasks, exhibiting a strong bias towards static scenes.
Result 3: Memory modules fail to effectively decouple state evolution from observation, exacerbating the bias towards static scenes.

Significance

This study highlights the limitations of current video world models in handling observation interruptions, which is crucial for generating larger-scale world models and supporting longer-horizon interactions. STEVO-Bench's evaluation allows researchers to identify failure modes in natural state evolution, providing guidance for future model improvements.

Technical Contribution

The paper presents STEVO-Bench, the first systematic benchmark to evaluate video world models' ability to evolve state during observation interruptions. This benchmark not only covers physical plausibility and coherence but also introduces evaluation of state evolution progress, filling a gap in existing benchmarks.

Novelty

STEVO-Bench is the first benchmark focused on evaluating video world models' ability to evolve state during observation interruptions. Unlike existing benchmarks, it comprehensively covers physical plausibility, coherence, and state evolution progress.

Limitations

Limitation 1: Video models show extremely low success rates in state evolution under observation control, indicating significant deficiencies in handling observation interruptions.
Limitation 2: Camera-controlled models exhibit a strong bias towards static scenes, leading to poor performance in dynamic processes.
Limitation 3: Memory modules fail to effectively decouple state evolution from observation, exacerbating the bias towards static scenes.

Future Work

Future research can explore new architectural designs to better support state evolution during observation interruptions. Additionally, developing new datasets and training strategies to reduce models' bias towards static scenes is an important direction.

AI Executive Summary

In the realm of artificial intelligence research, video world models are employed to generate visual worlds by synthesizing image frames to simulate changes in objects and properties. However, whether these models can continue evolving the world state during observation interruptions remains an unresolved question. To explore this, researchers have designed a benchmark called STEVO-Bench to evaluate video world models' ability to evolve state during observation interruptions.

STEVO-Bench applies observation control by inserting occluders, turning off lights, or specifying camera 'lookaway' trajectories. By comparing model performance with and without camera control, researchers have uncovered the limitations of current video world models in decoupling state evolution from observation. Experimental results show that video models have less than a 10% success rate in state evolution under observation control, while camera-controlled models almost never succeed in state evolution tasks.

This finding indicates significant deficiencies in current video world models when handling observation interruptions, especially in generating larger-scale world models and supporting longer-horizon interactions. STEVO-Bench's evaluation not only covers physical plausibility and coherence but also introduces evaluation of state evolution progress, filling a gap in existing benchmarks.

Furthermore, the study found that memory modules fail to effectively decouple state evolution from observation, exacerbating the bias towards static scenes. This result suggests that future research needs to explore new architectural designs to better support state evolution during observation interruptions.

In conclusion, STEVO-Bench provides a new tool for evaluating video world models' ability to evolve state during observation interruptions, revealing the limitations of current models and pointing the way for future research.

Deep Analysis

Background

Video world models have become a significant research area in artificial intelligence. These models simulate changes in objects and properties by generating visual worlds, with applications in autonomous driving, robotic navigation, and more. Despite numerous studies aimed at improving video generation quality and consistency, whether models can continue evolving the world state during observation interruptions remains an unresolved question. Existing benchmarks primarily focus on physical plausibility and coherence, neglecting the critical aspect of state evolution progress.

Core Problem

The ability of video world models to evolve state during observation interruptions is an unsolved problem. Current models often fail to correctly evolve the world state when handling observation interruptions. The core issue lies in whether models can continue evolving state in the absence of observation, which is crucial for generating larger-scale world models and supporting longer-horizon interactions.

Innovation

STEVO-Bench is the first benchmark focused on evaluating video world models' ability to evolve state during observation interruptions. Its innovations include: 1) Introducing evaluation of state evolution progress, filling a gap in existing benchmarks; 2) Applying observation control by inserting occluders, turning off lights, or specifying camera 'lookaway' trajectories; 3) Providing an automated verification protocol to detect and disentangle failure modes in natural state evolution.

Methodology

The design of STEVO-Bench includes the following steps:

�� Apply observation control by inserting occluders, turning off lights, or specifying camera 'lookaway' trajectories.
�� Compare model performance with and without camera control to evaluate their ability to evolve state during observation interruptions.
�� Use an automated verification protocol to detect and disentangle failure modes in natural state evolution.
�� Compare the performance of different models to reveal their limitations in decoupling state evolution from observation.

Experiments

The experimental design includes:

�� Datasets: 225 unique tasks covering six different categories of natural evolution processes.
�� Baselines: Compare the performance of different video models and camera-controlled models.
�� Metrics: Evaluate physical plausibility, coherence, and state evolution progress.
�� Hyperparameters: Adjust models' observation control strategies to optimize their performance in state evolution tasks.

Results

Experimental results show:

�� Video models have less than a 10% success rate in state evolution under observation control.
�� Camera-controlled models almost never succeed in state evolution tasks, exhibiting a strong bias towards static scenes.
�� Memory modules fail to effectively decouple state evolution from observation, exacerbating the bias towards static scenes.

Applications

The application scenarios of STEVO-Bench include:

�� Evaluating the performance of video world models in autonomous driving, robotic navigation, and other fields.
�� Helping researchers identify failure modes in natural state evolution, providing guidance for future model improvements.
�� Providing reference for developing new datasets and training strategies to reduce models' bias towards static scenes.

Limitations & Outlook

Despite providing a new tool for evaluating video world models' ability to evolve state during observation interruptions, STEVO-Bench has some limitations:

�� Current models exhibit significant deficiencies in handling observation interruptions, especially in generating larger-scale world models and supporting longer-horizon interactions.
�� Memory modules fail to effectively decouple state evolution from observation, exacerbating the bias towards static scenes.
�� Future research needs to explore new architectural designs to better support state evolution during observation interruptions.

Plain Language Accessible to non-experts

Imagine you're watching a theater performance, where actors perform various actions and scenes on stage. Suddenly, the stage lights go out, and you can't see what they're doing. But when the lights come back on, you expect to see the actors continuing their performance, not frozen in place or doing something illogical. This is similar to the ability of video world models to evolve state during observation interruptions. Researchers have designed a benchmark called STEVO-Bench to evaluate whether these models can continue evolving the world state when observation is interrupted. By inserting occluders, turning off lights, or specifying camera 'lookaway' trajectories, researchers tested the models' performance in such scenarios. The results show that current models have significant deficiencies in handling observation interruptions, especially in generating larger-scale world models and supporting longer-horizon interactions. This study provides important guidance for future model improvements.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool video game where the world changes automatically, like trees growing or rivers flowing. Suddenly, your screen goes black, and you can't see the changes in the game. But when the screen lights up again, you expect the game world to keep changing, not just pause. That's like what scientists are studying with video world models! They designed a tool called STEVO-Bench to test if these models can keep changing when you can't see them. Turns out, most models don't do well in this situation, like the game just paused. This research is important because it tells us how to improve these models for future applications, like self-driving cars and robots!

Glossary

Video World Model

A model that simulates changes in objects and properties by generating visual worlds.

Used to evaluate models' ability to evolve state during observation interruptions.

State Evolution

The process of change in objects or properties over time.

Evaluating whether models can continue evolving during observation interruptions.

Observation Control

Controlling the observation process by inserting occluders, turning off lights, or specifying camera trajectories.

Used to test models' performance during observation interruptions.

STEVO-Bench

A benchmark for evaluating video world models' ability to evolve state during observation interruptions.

Used to reveal models' limitations in decoupling state evolution from observation.

Physical Plausibility

Whether state evolution conforms to physical laws.

Evaluating whether models' generated state evolution is reasonable.

Coherence

The consistency of objects and scenes in a video.

Evaluating whether models can maintain coherence during observation interruptions.

Memory Module

A module used to store and recall object states.

Evaluating its role in decoupling state evolution from observation.

Camera-Controlled Model

A model that generates videos based on specified camera trajectories.

Evaluating its ability to evolve state during observation interruptions.

Dynamic Process

The dynamic change of objects or properties over time.

Evaluating whether models can continue evolving during observation interruptions.

Benchmark

A standardized test used to evaluate model performance.

STEVO-Bench is used to evaluate video world models' state evolution ability.

Open Questions Unanswered questions from this research

1 Current video world models show poor state evolution ability during observation interruptions, especially in generating larger-scale world models and supporting longer-horizon interactions. Solving this issue requires new architectural designs and training strategies.
2 Memory modules fail to effectively decouple state evolution from observation, exacerbating the bias towards static scenes. Future research needs to explore how to design more effective memory modules to support state evolution during observation interruptions.
3 Camera-controlled models perform poorly in dynamic processes, exhibiting a strong bias towards static scenes. This issue may be related to training data biases, requiring the development of new datasets and training strategies.
4 STEVO-Bench's evaluation results show significant deficiencies in current models when handling observation interruptions. This suggests that future research needs to explore new architectural designs to better support state evolution during observation interruptions.
5 Although STEVO-Bench provides a new tool for evaluating video world models' ability to evolve state during observation interruptions, further validation is needed for its effectiveness in different application scenarios.

Applications

Immediate Applications

Autonomous Driving

Evaluate autonomous driving systems' decision-making ability during observation interruptions to ensure vehicle safety in complex environments.

Robotic Navigation

Help robots continue navigation during observation interruptions to avoid collisions and getting lost.

Virtual Reality

Enhance the immersion of virtual reality systems during observation interruptions to ensure continuity of user experience.

Long-term Vision

Smart Cities

Improve smart city systems' response capabilities during observation interruptions to optimize city management.

Human-Computer Interaction

Develop smarter human-computer interaction systems that can continue understanding and responding to user needs during observation interruptions.

Abstract

Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/

cs.CV

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Video World Model

State Evolution

Observation Control

STEVO-Bench

Physical Plausibility

Coherence

Memory Module

Camera-Controlled Model

Dynamic Process

Benchmark

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Autonomous Driving

Robotic Navigation

Virtual Reality

Long-term Vision

Smart Cities

Human-Computer Interaction

Abstract

Related Papers

Visual-ERM: Reward Modeling for Visual Equivalence

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams