WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Key Findings

Methodology

The WildWorld dataset is collected using an automated toolchain from a AAA action role-playing game, containing over 108 million frames with a rich action space and explicit state annotations. The dataset is designed to support long-horizon compositional action sequence modeling and state evolution analysis. The WildBench benchmark evaluates model performance through Action Following and State Alignment.

Key Results

The WildWorld dataset includes over 450 actions, providing rich semantic information and diverse interaction scenarios, supporting long-horizon world state consistency modeling.
Experimental results reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation.
The WildBench benchmark shows that existing models have limited performance in action following and state alignment, providing directions for future research.

Significance

The WildWorld dataset provides a significant foundation for dynamic world modeling in generative ARPGs. By offering explicit state annotations and a rich action space, it addresses the deficiencies in existing datasets regarding action semantics and state evolution, providing researchers with a powerful tool to develop and evaluate interactive world models.

Technical Contribution

WildWorld offers a large-scale, semantically rich action space with explicit state annotations through an automated data collection toolchain, supporting long-horizon state evolution analysis. The WildBench benchmark introduces two key evaluation metrics: Action Following and State Alignment, offering new perspectives for model performance evaluation.

Novelty

WildWorld is the first large-scale action-conditioned world modeling dataset with explicit state annotations, filling the gap in existing datasets that lack semantically rich action and state information.

Limitations

The dataset primarily originates from a single game environment, which may limit the model's generalization capabilities to other environments.
The rules of the automated toolchain may lead to insufficient behavioral diversity.
The scale and complexity of the dataset may impose high demands on computational resources.

Future Work

Future research can explore how to apply the WildWorld dataset in different gaming environments to develop more generalizable models. Additionally, improving the automated toolchain to increase behavioral diversity and optimizing the use of computational resources are important research directions.

AI Executive Summary

The introduction of the WildWorld dataset provides a new platform for dynamic world modeling in generative action role-playing games (ARPGs). Existing datasets often lack diverse and semantically meaningful action spaces, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. WildWorld addresses these issues by automatically collecting data from a AAA action role-playing game, offering over 108 million frames and more than 450 actions, covering various interaction scenarios such as movement, attacks, and skill casting.

The dataset is designed to address the problem of actions being directly tied to visual observations in existing datasets. By providing explicit state annotations, WildWorld enables models to better learn action-conditioned state transitions, thereby supporting long-horizon world state consistency modeling. To evaluate model performance, the researchers also developed the WildBench benchmark, which assesses models through two key metrics: Action Following and State Alignment.

Experimental results indicate that existing models face challenges in modeling semantically rich actions and maintaining long-horizon state consistency. This finding underscores the necessity for state-aware video generation and provides directions for future research. The introduction of the WildWorld dataset not only offers researchers a powerful tool to develop and evaluate interactive world models but also lays the foundation for dynamic world modeling in generative ARPGs.

Despite the progress made in providing a rich action space and explicit state annotations, the WildWorld dataset primarily originates from a single game environment, which may limit the model's generalization capabilities to other environments. Additionally, the rules of the automated toolchain may lead to insufficient behavioral diversity. Future research can explore how to apply the WildWorld dataset in different gaming environments to develop more generalizable models.

Overall, the WildWorld dataset provides a significant foundation for dynamic world modeling in generative ARPGs. By offering explicit state annotations and a rich action space, it addresses the deficiencies in existing datasets regarding action semantics and state evolution, providing researchers with a powerful tool to develop and evaluate interactive world models.

Deep Analysis

Background

In recent years, significant progress has been made in video generation and world models. Many approaches attempt to learn environment dynamics from large-scale video datasets by training generative models. However, existing datasets often provide only simple action annotations with limited semantic meaning, making it difficult for models to learn structured action-conditioned dynamics and maintain consistent evolution over long horizons. The introduction of the WildWorld dataset aims to address these issues by providing explicit state annotations and a rich action space, offering a new platform for dynamic world modeling in generative ARPGs.

Core Problem

Existing datasets often lack diverse and semantically meaningful action spaces, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. Additionally, actions are directly tied to visual observations, making it challenging for models to disentangle state transitions from observation variations. These issues limit the performance of current models in long-horizon prediction tasks, where small errors accumulate over time, leading to noticeable inconsistencies or instability in the generated results.

Innovation

The WildWorld dataset addresses these challenges by automatically collecting data from a AAA action role-playing game, offering over 108 million frames and more than 450 actions, covering various interaction scenarios such as movement, attacks, and skill casting. The dataset is designed to address the problem of actions being directly tied to visual observations in existing datasets. By providing explicit state annotations, WildWorld enables models to better learn action-conditioned state transitions, thereby supporting long-horizon world state consistency modeling.

Methodology

�� Data Collection: Automatically collected from a AAA action role-playing game using an automated toolchain.
�� Data Annotation: Provides explicit state annotations, including character skeletons, world states, camera poses, and depth maps.
�� Dataset Scale: Contains over 108 million frames and more than 450 actions.
�� WildBench Benchmark: Evaluates model performance through two key metrics: Action Following and State Alignment.

Experiments

The experimental design includes using the WildWorld dataset for long-horizon compositional action sequence modeling and state evolution analysis. The WildBench benchmark evaluates model performance through two key metrics: Action Following and State Alignment. Experimental results indicate that existing models face challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the necessity for state-aware video generation.

Results

The WildWorld dataset includes over 450 actions, providing rich semantic information and diverse interaction scenarios, supporting long-horizon world state consistency modeling. Experimental results reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The WildBench benchmark shows that existing models have limited performance in action following and state alignment, providing directions for future research.

Applications

The WildWorld dataset provides a significant foundation for dynamic world modeling in generative ARPGs. By offering explicit state annotations and a rich action space, it addresses the deficiencies in existing datasets regarding action semantics and state evolution, providing researchers with a powerful tool to develop and evaluate interactive world models.

Limitations & Outlook

Despite the progress made in providing a rich action space and explicit state annotations, the WildWorld dataset primarily originates from a single game environment, which may limit the model's generalization capabilities to other environments. Additionally, the rules of the automated toolchain may lead to insufficient behavioral diversity. Future research can explore how to apply the WildWorld dataset in different gaming environments to develop more generalizable models.

Plain Language Accessible to non-experts

Imagine you're playing a complex role-playing game. There are many characters, each with different actions like attacking, moving, and casting spells. Now, suppose you want a computer to learn how to control these characters in the game as smartly as you do. That's what the WildWorld dataset does. It's like a huge library of game recordings, capturing every action and state of each character in the game. Through these recordings, computers can learn how to make the right decisions in the game, just like an experienced player.

But WildWorld is not just an ordinary library of recordings. It also provides the 'secret information' behind each action, like the skeleton of the character, the state of the world, and the position of the camera. This information acts like a 'manual' for the game, helping the computer understand the true meaning of each action.

By learning from this information, computers can make smarter decisions in the game, like knowing when to attack and when to dodge. This ability is crucial for developing smarter game characters and more complex game worlds.

In short, the WildWorld dataset is like a 'wisdom book' for games, helping computers learn how to make smart decisions in complex game worlds.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game like Monster Hunter. In the game, you can control characters to fight monsters, cast spells, and even ride mounts across the map. Now, imagine if a computer could play this game as smartly as you do. How cool would that be?

That's what the WildWorld dataset does. It's like a super big library of game recordings, capturing every action of each character in the game, like attacking, moving, and casting spells. Through these recordings, computers can learn how to make the right decisions in the game, just like an experienced player.

But that's not all! The WildWorld dataset also provides the 'secret information' behind each action, like the skeleton of the character, the state of the world, and the position of the camera. This information acts like a 'manual' for the game, helping the computer understand the true meaning of each action.

So, the WildWorld dataset is like a 'wisdom book' for games, helping computers learn how to make smart decisions in complex game worlds. Isn't that awesome?

Glossary

WildWorld

WildWorld is a large-scale action-conditioned world modeling dataset with explicit state annotations and a rich action space.

Used for dynamic world modeling in generative ARPGs.

AAA Game

AAA games are high-budget, high-quality video games typically developed by large game companies.

WildWorld dataset is collected from a AAA action role-playing game.

Action Role-Playing Game (ARPG)

ARPG is a genre of games combining action and role-playing elements, where players control characters to fight and explore.

WildWorld dataset supports dynamic world modeling in generative ARPGs.

Action Tracking

Action tracking evaluates whether the model accurately reproduces input actions in the generated video.

A key evaluation metric in the WildBench benchmark.

State Alignment

State alignment evaluates whether the model accurately reproduces input states in the generated video.

A key evaluation metric in the WildBench benchmark.

Generative Video

Generative video refers to synthetic videos generated by models, typically based on input images or text.

WildWorld dataset supports research in generative video.

Long-Horizon Sequence

Long-horizon sequence refers to a continuous data sequence over an extended period, often used for analyzing dynamic changes.

WildWorld dataset supports long-horizon world state consistency modeling.

Explicit State Annotation

Explicit state annotation refers to clearly marking the state information of each frame in the dataset, such as character skeletons and world states.

WildWorld dataset provides explicit state annotations to support model learning.

Automated Toolchain

An automated toolchain is a set of automated software tools used for efficiently collecting and processing data.

Used to automatically collect data for the WildWorld dataset.

Benchmark

A benchmark is a standardized test set or method used to evaluate model performance.

WildBench benchmark is used to evaluate model performance on the WildWorld dataset.

Open Questions Unanswered questions from this research

1 How can the WildWorld dataset be applied in different gaming environments to improve model generalization capabilities? Existing datasets primarily originate from a single game environment, which may limit model performance in other environments.
2 How can the automated toolchain be improved to increase behavioral diversity? Current toolchain rules may lead to insufficient behavioral diversity, affecting model learning outcomes.
3 How can computational resources be optimized to handle large-scale datasets? The scale and complexity of the WildWorld dataset may impose high demands on computational resources.
4 How can existing models' performance in long-horizon prediction tasks be improved? Current models may experience issues with error accumulation over long sequences.
5 How can explicit state annotations be applied to other types of datasets? The explicit state annotations in the WildWorld dataset provide important information for model learning, but how to achieve similar annotations in other datasets remains to be explored.

Applications

Immediate Applications

Game AI Development

Game developers can use the WildWorld dataset to train smarter game AI, enhancing the decision-making capabilities of game characters.

Video Generation Research

Researchers can use the dataset for generative video research, exploring new video generation techniques.

Interactive System Design

Designers can use the dataset to develop more interactive systems, improving user experience.

Long-term Vision

Intelligent Game Worlds

By continuously improving models and datasets, more intelligent and complex game worlds can be realized in the future.

Cross-Domain Applications

The technology from the WildWorld dataset can be applied to other fields, such as autonomous driving and robotics, driving cross-domain technological advancements.

Abstract

Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.

cs.CV

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

WildWorld

AAA Game

Action Role-Playing Game (ARPG)

Action Tracking

State Alignment

Generative Video

Long-Horizon Sequence

Explicit State Annotation

Automated Toolchain

Benchmark

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Game AI Development

Video Generation Research

Interactive System Design

Long-term Vision

Intelligent Game Worlds

Cross-Domain Applications

Abstract

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock