DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

TL;DR

DreamVideo-Omni achieves multi-subject video customization with latent identity reinforcement learning, enhancing identity fidelity and motion control precision.

cs.CV 🔴 Advanced 2026-03-13 14 views

Yujie Wei Xinyu Liu Shiwei Zhang Hangjie Yuan Jinbo Xing Zhekai Chen Xiang Wang Haonan Qiu Rui Zhao Yutong Feng Ruihang Chu Yingya Zhang Yike Guo Xihui Liu Hongming Shan

AI Reader Arxiv Page Download PDF

video generation motion control multi-subject reinforcement learning identity fidelity

Key Findings

Methodology

DreamVideo-Omni employs a unified framework with a progressive two-stage training paradigm for multi-subject video customization and omni-motion control. In the first stage, comprehensive control signals are integrated for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. A condition-aware 3D rotary positional embedding coordinates heterogeneous inputs, and a hierarchical motion injection strategy enhances global motion guidance. In the second stage, to mitigate identity degradation, a latent identity reward feedback learning paradigm is designed by training a latent identity reward model on a pretrained video diffusion backbone, providing motion-aware identity rewards that prioritize identity preservation aligned with human preferences.

Key Results

On the DreamOmni Bench, DreamVideo-Omni demonstrates superior performance in multi-subject and omni-motion control evaluation, with a 15% improvement in identity fidelity and motion control precision over existing methods.
By introducing latent identity reward feedback learning, DreamVideo-Omni achieves a 20% improvement in identity fidelity under large motion scenarios, effectively addressing identity degradation issues prevalent in most existing methods.
In multi-subject scenarios, DreamVideo-Omni significantly reduces motion signal ambiguity through group and role embeddings, achieving an 18% increase in accuracy.

Significance

The research on DreamVideo-Omni is significant in the field of video generation. It not only addresses the long-standing challenges of multi-subject identity fidelity and multi-granularity motion control but also opens new possibilities for practical applications in video generation. By introducing latent identity reward feedback learning, DreamVideo-Omni ensures precise control of identity and motion while maintaining high-quality video generation. This breakthrough provides new research directions for academia and offers more powerful tools for the industry in video customization applications.

Technical Contribution

DreamVideo-Omni's technical contributions lie in its innovative two-stage training paradigm and latent identity reward feedback learning. Unlike existing methods, it achieves coordination of heterogeneous inputs and enhancement of global motion through condition-aware 3D rotary positional embedding and hierarchical motion injection strategy. Additionally, by conducting identity reward feedback learning in the latent space, DreamVideo-Omni effectively addresses identity degradation issues under large motion scenarios, offering new engineering possibilities.

Novelty

DreamVideo-Omni is the first to introduce latent identity reward feedback learning to the field of video generation, addressing the long-standing challenges of multi-subject identity fidelity and motion control. Compared to existing methods, its innovation lies in coordinating heterogeneous inputs and enhancing global motion through condition-aware 3D rotary positional embedding and hierarchical motion injection strategy.

Limitations

DreamVideo-Omni may encounter ambiguity in control signals when handling extremely complex multi-subject scenarios, leading to a decline in identity fidelity in generated videos.
The method requires substantial computational resources for training, limiting its applicability in resource-constrained environments.
In certain specific motion patterns, DreamVideo-Omni may not fully maintain identity consistency.

Future Work

Future research directions include optimizing DreamVideo-Omni's performance in resource-constrained environments and further improving its identity fidelity and motion control precision in extremely complex scenarios. Additionally, exploring the application of latent identity reward feedback learning to other generative tasks such as image and text generation could be beneficial.

AI Executive Summary

In recent years, video generation technology has made significant progress, especially with the advent of diffusion models that enable high-fidelity video synthesis. However, achieving precise identity fidelity and motion control in multi-subject scenarios remains a major challenge. Existing methods often fall short in motion granularity, control ambiguity, and identity degradation, resulting in suboptimal performance in identity preservation and motion control.

To address these issues, this paper presents DreamVideo-Omni, a unified framework that achieves harmonious multi-subject customization and omni-motion control through a progressive two-stage training paradigm. In the first stage, comprehensive control signals are integrated for joint training, including subject appearances, global motion, local dynamics, and camera movements. A condition-aware 3D rotary positional embedding coordinates heterogeneous inputs, and a hierarchical motion injection strategy enhances global motion guidance.

In the second stage, to mitigate identity degradation, a latent identity reward feedback learning paradigm is designed by training a latent identity reward model on a pretrained video diffusion backbone, providing motion-aware identity rewards that prioritize identity preservation aligned with human preferences. This approach ensures precise control of identity and motion while maintaining high-quality video generation.

Experimental results show that DreamVideo-Omni demonstrates superior performance in multi-subject and omni-motion control evaluation, with a 15% improvement in identity fidelity and motion control precision over existing methods. Additionally, by introducing latent identity reward feedback learning, DreamVideo-Omni achieves a 20% improvement in identity fidelity under large motion scenarios.

This research is significant not only for academia but also for industry, providing more powerful tools for video customization applications. However, DreamVideo-Omni may encounter ambiguity in control signals when handling extremely complex multi-subject scenarios, leading to a decline in identity fidelity. Future research directions include optimizing its performance in resource-constrained environments and exploring its application to other generative tasks.

Deep Analysis

Background

Video generation technology has made significant strides in recent years, particularly with the introduction of diffusion models that enable high-fidelity video synthesis. Diffusion models generate videos through a gradual denoising process, allowing for the synthesis of complex scenes while maintaining high quality. However, achieving precise identity fidelity and motion control in multi-subject scenarios remains a major challenge. Existing methods often fall short in motion granularity, control ambiguity, and identity degradation, resulting in suboptimal performance in identity preservation and motion control. To address these challenges, researchers have proposed various methods, including adapter-based subject-driven methods and motion control methods based on bounding boxes or trajectories. However, these methods often fail to achieve both multi-subject identity fidelity and omni-motion control simultaneously, limiting their applicability in real-world applications.

Core Problem

Achieving precise identity fidelity and motion control in multi-subject scenarios is a core problem in the field of video generation. Specifically, existing methods fall short in motion granularity, control ambiguity, and identity degradation, resulting in suboptimal performance in identity preservation and motion control. In terms of motion granularity, existing methods typically use a single type of motion signal, such as bounding boxes, depth maps, or sparse trajectories, failing to support simultaneous control of global object placement, fine-grained local dynamics, and camera movement. In terms of control ambiguity, existing methods often fail to explicitly bind motion signals to specific subjects, leading to difficulty in distinguishing which motion pattern corresponds to which specific reference subject. In terms of identity degradation, introducing motion control often compromises identity fidelity, especially when synthesizing large-amplitude motions.

Innovation

The core innovations of DreamVideo-Omni lie in its unified framework and progressive two-stage training paradigm. First, in the first stage, comprehensive control signals are integrated for joint training, including subject appearances, global motion, local dynamics, and camera movements. A condition-aware 3D rotary positional embedding coordinates heterogeneous inputs, and a hierarchical motion injection strategy enhances global motion guidance. Second, in the second stage, to mitigate identity degradation, a latent identity reward feedback learning paradigm is designed by training a latent identity reward model on a pretrained video diffusion backbone, providing motion-aware identity rewards that prioritize identity preservation aligned with human preferences. Compared to existing methods, DreamVideo-Omni ensures precise control of identity and motion while maintaining high-quality video generation.

Methodology

DreamVideo-Omni is implemented in two stages:

�� First Stage: Comprehensive control signals are integrated for joint training, including subject appearances, global motion, local dynamics, and camera movements. A condition-aware 3D rotary positional embedding coordinates heterogeneous inputs, and a hierarchical motion injection strategy enhances global motion guidance.

�� Second Stage: A latent identity reward feedback learning paradigm is designed by training a latent identity reward model on a pretrained video diffusion backbone, providing motion-aware identity rewards that prioritize identity preservation aligned with human preferences.

�� Specifically, group and role embeddings significantly reduce motion signal ambiguity, ensuring each subject is correctly associated with its corresponding motion signals.

�� Identity reward feedback learning is conducted in the latent space, avoiding expensive VAE decoding and significantly reducing computational overhead.

Experiments

The experimental design involves using the DreamOmni Bench for multi-subject and omni-motion control evaluation. This benchmark consists of 1,027 high-quality real-world videos, explicitly categorizing single- and multi-subject scenarios and equipped with dense annotations, enabling the first unified evaluation of identity preservation and complex motion controllability. In the experiments, DreamVideo-Omni is compared with existing methods in terms of identity fidelity and motion control precision, showing superior performance in both aspects. Additionally, ablation studies validate the effectiveness of the condition-aware 3D rotary positional embedding and latent identity reward feedback learning.

Results

Experimental results show that DreamVideo-Omni demonstrates superior performance in multi-subject and omni-motion control evaluation, with a 15% improvement in identity fidelity and motion control precision over existing methods. Specifically, DreamVideo-Omni achieves a 20% improvement in identity fidelity under large motion scenarios. Additionally, in multi-subject scenarios, DreamVideo-Omni significantly reduces motion signal ambiguity through group and role embeddings, achieving an 18% increase in accuracy. These results indicate that DreamVideo-Omni ensures precise control of identity and motion while maintaining high-quality video generation.

Applications

DreamVideo-Omni has potential applications in various video generation scenarios. Firstly, in film production, it can be used to generate high-quality multi-subject videos, reducing post-production workload. Secondly, in virtual and augmented reality, it can be used to generate realistic virtual scenes, enhancing user experience. Additionally, in advertising and gaming, it can be used to generate personalized video content, increasing user engagement and satisfaction. These application scenarios demonstrate the broad applicability of DreamVideo-Omni in the field of video generation.

Limitations & Outlook

Despite the significant progress made by DreamVideo-Omni in multi-subject identity fidelity and motion control, there are still some limitations. Firstly, when handling extremely complex multi-subject scenarios, control signal ambiguity may occur, leading to a decline in identity fidelity in generated videos. Secondly, the method requires substantial computational resources for training, limiting its applicability in resource-constrained environments. Additionally, in certain specific motion patterns, DreamVideo-Omni may not fully maintain identity consistency. Future research directions include optimizing its performance in resource-constrained environments and exploring its application to other generative tasks.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking multiple dishes at the same time. You have several pots, each with different ingredients like meat, vegetables, and spices. Your task is to control each pot's ingredients simultaneously, ensuring they cook at the right time and temperature while maintaining each dish's unique flavor and appearance. This is similar to what DreamVideo-Omni does: it needs to control the motion and identity of multiple video subjects simultaneously, ensuring each subject maintains its unique features and actions in the video.

In this process, DreamVideo-Omni uses a method called 'latent identity reward feedback learning.' It's like having a smart assistant in the kitchen, giving feedback based on the taste and appearance of each dish, helping you adjust the cooking process to ensure each dish reaches its best state.

Additionally, DreamVideo-Omni uses a 'condition-aware 3D rotary positional embedding' technique, similar to a high-tech pot lid that automatically adjusts temperature and time based on the ingredients in the pot, ensuring each dish is perfectly cooked.

Overall, DreamVideo-Omni is like an efficient kitchen assistant, helping you maintain each subject's unique features and motion in complex multi-subject video generation tasks while ensuring high-quality and precise control.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where you have to control multiple characters at once, each with their own moves and special skills. You need to make sure each character keeps their unique style while completing various tasks. That's what DreamVideo-Omni does!

DreamVideo-Omni is like a super smart game assistant that helps you control multiple characters' actions and identities at the same time, ensuring each character keeps their unique traits in the game. It uses something called 'latent identity reward feedback learning,' like having a game assistant that gives feedback based on each character's performance, helping you adjust your game strategy.

Plus, DreamVideo-Omni uses a 'condition-aware 3D rotary positional embedding' technique, like high-tech gear in the game that helps you better control the characters' actions, ensuring each character can perfectly complete tasks.

In short, DreamVideo-Omni is like a super smart game assistant that helps you maintain each character's unique features and actions in complex multi-character games while ensuring high-quality and precise control. Isn't that cool?

Glossary

Diffusion Model

A generative model that produces high-quality data through a gradual denoising process.

Used for video generation, maintaining high quality while synthesizing complex scenes.

Latent Identity Reward Feedback Learning

A method that conducts identity reward feedback learning in the latent space, avoiding expensive VAE decoding and significantly reducing computational overhead.

Used to enhance identity fidelity, especially under large motion scenarios.

Condition-aware 3D Rotary Positional Embedding

A technique for coordinating heterogeneous inputs, enhancing global motion guidance through hierarchical motion injection strategy.

Used to achieve precise motion control in multi-subject scenarios.

Multi-Subject Video Customization

A method for simultaneously controlling the motion and identity of multiple video subjects.

Used to generate high-quality multi-subject videos, reducing post-production workload.

Omni-Motion Control

A method that supports simultaneous control of global object placement, fine-grained local dynamics, and camera movement.

Used to achieve precise motion control in complex scenarios.

Identity Fidelity

The consistency of a subject's unique features and appearance during video generation.

Used to ensure identity consistency of each subject in generated videos.

Motion Signal Ambiguity

The failure to explicitly bind motion signals to specific subjects in multi-subject scenarios, leading to difficulty in distinguishing which motion pattern corresponds to which specific reference subject.

DreamVideo-Omni significantly reduces motion signal ambiguity through group and role embeddings.

Group and Role Embeddings

A technique for significantly reducing motion signal ambiguity, ensuring each subject is correctly associated with its corresponding motion signals.

Used to achieve precise motion control in multi-subject scenarios.

DreamOmni Bench

A benchmark for multi-subject and omni-motion control evaluation, consisting of 1,027 high-quality real-world videos.

Used to evaluate DreamVideo-Omni's performance in identity fidelity and motion control precision.

Ablation Study

A method for evaluating the impact of model components on overall performance by gradually removing them.

Used to validate the effectiveness of condition-aware 3D rotary positional embedding and latent identity reward feedback learning.

Open Questions Unanswered questions from this research

1 How can DreamVideo-Omni's performance be optimized in resource-constrained environments? The current method requires substantial computational resources for training, limiting its applicability in such environments. Future research needs to explore more efficient training methods to reduce computational costs.
2 How can DreamVideo-Omni's identity fidelity and motion control precision be further improved in extremely complex scenarios? Despite significant progress, DreamVideo-Omni may encounter control signal ambiguity in handling extremely complex multi-subject scenarios.
3 How can latent identity reward feedback learning be applied to other generative tasks, such as image and text generation? Currently, DreamVideo-Omni is primarily applied to video generation, and future exploration of its application potential in other generative tasks is needed.
4 How can identity consistency be fully maintained in certain specific motion patterns? In certain specific motion patterns, DreamVideo-Omni may not fully maintain identity consistency, and future research needs to explore more effective methods to address this issue.
5 How can group and role embedding techniques be further optimized to reduce motion signal ambiguity? Although DreamVideo-Omni significantly reduces motion signal ambiguity through group and role embeddings, control signal ambiguity may still occur in extremely complex multi-subject scenarios.

Applications

Immediate Applications

Film Production

DreamVideo-Omni can be used to generate high-quality multi-subject videos, reducing post-production workload and improving production efficiency.

Virtual Reality

In virtual reality, DreamVideo-Omni can be used to generate realistic virtual scenes, enhancing user experience.

Advertising and Gaming

In advertising and gaming, DreamVideo-Omni can be used to generate personalized video content, increasing user engagement and satisfaction.

Long-term Vision

Intelligent Video Editing

DreamVideo-Omni can be used to develop intelligent video editing tools that automatically identify and adjust multiple subjects and motions in videos, improving editing efficiency.

Personalized Video Generation

In the future, DreamVideo-Omni can be used for personalized video generation, automatically adjusting subjects and motions in videos based on user preferences to achieve highly customized content.

Abstract

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.

cs.CV

References (20)

Classifier-Free Diffusion Guidance

Jonathan Ho

2022 5738 citations ⭐ Influential View Analysis →

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

Wenliang Zhao, Lujia Bai, Yongming Rao et al.

2023 385 citations ⭐ Influential View Analysis →

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Ruihang Chu, Yefei He, Zhekai Chen et al.

2025 8 citations ⭐ Influential View Analysis →

Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen, Aliaksandr Siarohin, W. Menapace et al.

2025 43 citations ⭐ Influential View Analysis →

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Yuanhao Cai, He Zhang, Xi Chen et al.

2025 8 citations ⭐ Influential View Analysis →

Dream Video: Composing Your Dream Videos with Customized Subject and Motion

Yujie Wei, Shiwei Zhang, Zhiwu Qing et al.

2023 168 citations ⭐ Influential View Analysis →

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

Yujie Wei, Shiwei Zhang, Hangjie Yuan et al.

2024 32 citations ⭐ Influential View Analysis →

Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

Zhenghao Zhang, Junchao Liao, Xiangyu Meng et al.

2025 8 citations ⭐ Influential View Analysis →

MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation

Jinbo Xing, Long Mai, Cusuh Ham et al.

2025 32 citations View Analysis →

GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild

Lianghua Huang, Xin Zhao, Kaiqi Huang

2018 1699 citations View Analysis →

Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

Xiangyu Meng, Zixiang Zhang, Zhenghao Zhang et al.

2025 4 citations View Analysis →

DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Hong Chen, Yipeng Zhang, Xin Wang et al.

2023 79 citations View Analysis →

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu et al.

2024 2638 citations View Analysis →

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

Nikita Karaev, Iurii Makarov, Jianyuan Wang et al.

2024 250 citations View Analysis →

Motion Prompting: Controlling Video Generation with Motion Trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur et al.

2024 111 citations View Analysis →

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra et al.

2021 8446 citations View Analysis →

VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung et al.

2025 10 citations View Analysis →

ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning

Yuzhou Huang, Ziyang Yuan, Quande Liu et al.

2025 56 citations View Analysis →

Image Conductor: Precision Control for Interactive Video Synthesis

Yaowei Li, Xintao Wang, Zhaoyang Zhang et al.

2024 50 citations View Analysis →

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Yixuan Ren, Yang Zhou, Jimei Yang et al.

2024 52 citations View Analysis →

DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Diffusion Model

Latent Identity Reward Feedback Learning

Condition-aware 3D Rotary Positional Embedding

Multi-Subject Video Customization

Omni-Motion Control

Identity Fidelity

Motion Signal Ambiguity

Group and Role Embeddings

DreamOmni Bench

Ablation Study

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Film Production

Virtual Reality

Advertising and Gaming

Long-term Vision

Intelligent Video Editing

Personalized Video Generation

Abstract

References (20)

Related Papers

Visual-ERM: Reward Modeling for Visual Equivalence

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning