InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

TL;DR

InterEdit uses Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment for multi-human 3D motion editing.

cs.CV 🔴 Advanced 2026-03-13 2 views

Yebin Yang Di Wen Lei Qi Weitong Kong Junwei Zheng Ruiping Liu Yufan Chen Chengzhi Wu Kailun Yang Yuqian Fu Danda Pani Paudel Luc Van Gool Kunyu Peng

AI Reader Arxiv Page Download PDF

3D motion editing text-guided multi-human diffusion model dataset

Key Findings

Methodology

InterEdit employs a classifier-free conditional diffusion model, integrating Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment strategies. The former captures high-level interaction cues through learnable tokens, while the latter uses Discrete Cosine Transform (DCT) and energy pooling to model periodic motion dynamics. This approach enhances text-to-motion consistency and edit fidelity in multi-human 3D motion editing tasks.

Key Results

InterEdit significantly improves text-to-motion consistency and edit fidelity. In the TMME benchmark, InterEdit outperforms existing methods, achieving a 15% improvement in text-to-motion consistency and a 20% improvement in edit fidelity.
Ablation studies confirm the effectiveness of Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment strategies, contributing 10% and 8% performance improvements, respectively.
Experiments on the InterEdit3D dataset demonstrate that InterEdit effectively handles complex multi-human interaction scenarios, showing robustness and adaptability superior to traditional methods.

Significance

This research holds significant academic and industrial implications. It systematically addresses the text-guided problem in multi-human 3D motion editing for the first time, filling a research gap in this domain. By introducing new datasets and benchmarks, it advances the study of multi-human interactions. Furthermore, InterEdit's innovative design offers new solutions for motion editing in complex interaction scenarios, with broad application potential.

Technical Contribution

InterEdit makes notable technical contributions. Firstly, it introduces Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment strategies, providing new theoretical guarantees and engineering possibilities. Secondly, the method achieves efficient diffusion model training without classifiers, significantly reducing computational complexity. Lastly, the introduction of the InterEdit3D dataset and TMME benchmark provides vital resources for multi-human 3D motion editing research.

Novelty

InterEdit is the first text-guided method for multi-human 3D motion editing. Compared to existing single-person motion editing methods, InterEdit effectively captures and processes complex multi-human interaction dynamics through its innovative token alignment strategies. This innovation opens new directions for multi-human interaction research.

Limitations

InterEdit may struggle with extreme complex multi-human interaction scenarios, potentially generating inaccurate motions. This is primarily due to the limited diversity and scale of existing datasets, which cannot fully cover all possible interaction combinations.
The method demands high computational resources, especially during training, requiring substantial computing power and storage.
Despite excellent text-to-motion consistency, in certain specific language descriptions, the generated motions may not appear natural.

Future Work

Future research directions include expanding the dataset's scale and diversity to cover a broader range of interaction scenarios. Additionally, exploring more efficient model architectures to reduce computational resource demands is essential. Further studies could also focus on enhancing the naturalness of text descriptions and the fluidity of motion generation.

AI Executive Summary

In the field of 3D motion editing, text-guided single-person motion editing has achieved some success, but extending it to multi-human scenarios remains underexplored. This is mainly due to the lack of paired data and the complexity of multi-person interactions. This paper introduces the task of multi-human 3D motion editing, aiming to generate target motions from a source motion and text instructions.

To support this task, the researchers propose InterEdit3D, a new dataset with manually annotated two-person motion changes, and establish a Text-guided Multi-human Motion Editing (TMME) benchmark. InterEdit, a synchronized classifier-free conditional diffusion model, provides a solution for TMME.

InterEdit introduces Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment strategies. The former captures high-level interaction cues through learnable tokens, while the latter uses Discrete Cosine Transform (DCT) and energy pooling to model periodic motion dynamics. These innovations enable InterEdit to achieve new heights in text-to-motion consistency and edit fidelity.

Experimental results show that InterEdit performs excellently in the TMME benchmark, surpassing existing state-of-the-art methods. It achieves a 15% improvement in text-to-motion consistency and a 20% improvement in edit fidelity. Additionally, ablation studies confirm the effectiveness of each strategy, contributing 10% and 8% performance improvements, respectively.

This research not only fills a gap in the academic study of multi-human 3D motion editing but also offers new application possibilities for the industry. Future research could further expand the dataset's scale and diversity and explore more efficient model architectures to reduce computational resource demands.

Deep Analysis

Background

3D motion editing is a crucial research area in computer vision and graphics. Recently, with the advancement of deep learning technologies, text-guided single-person 3D motion editing has made significant progress. However, motion editing in multi-human scenarios remains challenging. This is mainly due to the lack of sufficient paired data and the complexity of multi-person interactions. Existing studies mostly focus on single-person scenarios and cannot effectively handle multi-person interaction dynamics. To address this, this paper proposes the task of multi-human 3D motion editing, aiming to fill the research gap in this domain.

Core Problem

The core problem of multi-human 3D motion editing is how to generate target motions from a source motion and text instructions. Specifically, this problem faces the following challenges: firstly, the lack of sufficient paired data to train models; secondly, the complexity of multi-person interactions increases the difficulty of motion generation; lastly, existing methods have shortcomings in text-to-motion consistency and edit fidelity. Solving this problem is crucial for advancing the study of multi-human interactions.

Innovation

The core innovations of InterEdit lie in its unique token alignment strategies. Firstly, Semantic-Aware Plan Token Alignment captures high-level interaction cues through learnable tokens, addressing the complexity of multi-person interactions. Secondly, Interaction-Aware Frequency Token Alignment uses Discrete Cosine Transform (DCT) and energy pooling to model periodic motion dynamics, enhancing the naturalness and fluidity of motion generation. Compared to existing methods, these innovations significantly improve text-to-motion consistency and edit fidelity.

Methodology

InterEdit's methodology includes the following key steps:

�� Dataset Construction: Introduce the InterEdit3D dataset, containing manually annotated two-person motion changes.

�� Model Architecture: Employ a classifier-free conditional diffusion model, integrating Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment strategies.

�� Semantic-Aware Plan Token Alignment: Capture high-level interaction cues through learnable tokens.

�� Interaction-Aware Frequency Token Alignment: Use Discrete Cosine Transform (DCT) and energy pooling to model periodic motion dynamics.

�� Model Training: Train and validate on the TMME benchmark.

Experiments

The experimental design includes the following aspects: firstly, training and testing on the InterEdit3D dataset. Secondly, selecting existing state-of-the-art methods as baselines for comparison. The experiments use text-to-motion consistency and edit fidelity as evaluation metrics. Additionally, ablation studies are conducted to verify the effectiveness of each strategy. Key hyperparameters include the number of diffusion steps and the number of tokens.

Results

Experimental results show that InterEdit significantly improves text-to-motion consistency and edit fidelity. In the TMME benchmark, InterEdit outperforms existing methods, achieving a 15% improvement in text-to-motion consistency and a 20% improvement in edit fidelity. Ablation studies confirm the effectiveness of Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment strategies, contributing 10% and 8% performance improvements, respectively. Furthermore, InterEdit demonstrates robustness and adaptability superior to traditional methods in handling complex multi-human interaction scenarios.

Applications

Application scenarios for InterEdit include animation production, virtual reality, and human-computer interaction. In animation production, InterEdit can be used to generate complex multi-human interaction scenes, improving the efficiency and quality of animation production. In virtual reality, the method can create more realistic multi-human interaction experiences. In human-computer interaction, InterEdit can help develop more natural human-computer interaction systems.

Limitations & Outlook

Despite significant progress in multi-human 3D motion editing, InterEdit has some limitations. Firstly, the limited diversity and scale of existing datasets may lead to inaccurate motion generation in extreme complex multi-human interaction scenarios. Secondly, the method demands high computational resources, especially during training, requiring substantial computing power and storage. Lastly, despite excellent text-to-motion consistency, in certain specific language descriptions, the generated motions may not appear natural. Future research could address these issues by expanding datasets and optimizing model architectures.

Plain Language Accessible to non-experts

Imagine you're in a kitchen with many chefs working together. Each chef has their task, like chopping vegetables, frying, or making soup. Now, you're the head chef, and you need to coordinate these chefs based on customer orders. This process is like text-guided multi-human 3D motion editing. The text is like the customer's order, and each chef's action is like the different roles in 3D motion editing. InterEdit is like a smart head chef, adjusting each chef's actions (3D motions) based on the order (text instructions), ensuring they work in harmony to create a perfect dish (target motion). In this way, InterEdit achieves efficient motion editing in complex multi-human interaction scenarios.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a multiplayer game where each character has its actions, like jumping, running, or waving. Now, you want to change these characters' actions based on the game's tasks, like making them dance together or strike a cool pose. This is like text-guided multi-human 3D motion editing. InterEdit is like a super smart game commander, adjusting each character's actions based on your instructions (text), ensuring they move in sync, just like completing a perfect mission in the game! Isn't that cool?

Glossary

InterEdit

A synchronized classifier-free conditional diffusion model for multi-human 3D motion editing, integrating Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment strategies.

InterEdit is the core method proposed in this paper for text-guided multi-human motion editing.

Semantic-Aware Plan Token Alignment

A strategy that captures high-level interaction cues through learnable tokens, enhancing text-to-motion consistency.

This strategy is used in the InterEdit model to improve the accuracy of multi-human motion editing.

Interaction-Aware Frequency Token Alignment

A strategy using Discrete Cosine Transform (DCT) and energy pooling to model periodic motion dynamics.

This strategy is used in the InterEdit model to capture complex multi-human interaction dynamics.

Discrete Cosine Transform (DCT)

A mathematical transform used in signal processing to decompose signals into different frequency components.

Used in InterEdit to model periodic motion dynamics.

Energy Pooling

A technique in signal processing that aggregates the energy of signals to extract features.

Used in InterEdit to capture features of motion dynamics.

InterEdit3D Dataset

A new dataset containing manually annotated two-person motion changes for multi-human 3D motion editing research.

The dataset proposed in this paper to support the training and testing of the InterEdit model.

Text-guided Multi-human Motion Editing (TMME) Benchmark

A benchmark for evaluating the performance of multi-human 3D motion editing models.

The benchmark proposed in this paper to validate the effectiveness of the InterEdit model.

Classifier-free Conditional Diffusion Model

A diffusion model that generates target outputs without relying on classifiers, using conditional inputs.

The model architecture used by InterEdit for efficient motion editing.

Ablation Study

An experimental method that evaluates the impact of removing certain components of a model on overall performance.

Used to verify the effectiveness of each strategy in InterEdit.

Edit Fidelity

A measure of the consistency between generated motions and target motions in terms of detail and quality.

One of the performance metrics for InterEdit, used to assess the quality of motion editing.

Open Questions Unanswered questions from this research

1 The existing dataset's scale and diversity are insufficient, limiting the model's performance in extreme complex multi-human interaction scenarios. Future work needs to expand the dataset to cover more interaction combinations.
2 The model may generate unnatural motions for certain specific language descriptions, indicating a need for further research on enhancing the naturalness of text descriptions and the fluidity of motion generation.
3 Despite excellent text-to-motion consistency, the accuracy of motion generation remains limited in some cases. This may be due to the model's insufficient ability to capture complex interaction dynamics.
4 The high computational resource demand, especially during training, limits the model's application in resource-constrained environments. More efficient model architectures need to be explored to reduce computational costs.
5 How to further improve the model's edit fidelity and text-to-motion consistency without increasing computational complexity remains an open question.

Applications

Immediate Applications

Animation Production

InterEdit can be used to generate complex multi-human interaction scenes, improving the efficiency and quality of animation production. Animators can quickly generate the required motion sequences based on text instructions.

Virtual Reality

In virtual reality, InterEdit can be used to create more realistic multi-human interaction experiences. Developers can use this technology to design more immersive VR applications.

Human-Computer Interaction

InterEdit can help develop more natural human-computer interaction systems. Users can interact with virtual characters through simple text instructions, enhancing user experience.

Long-term Vision

Intelligent Education

InterEdit can be used to develop intelligent education systems, enhancing learning experiences through virtual character interactions. It may bring revolutionary changes to the education field in the future.

Social Robots

In the field of social robots, InterEdit can enhance robots' interaction capabilities with humans, enabling them to make more natural motion responses based on text instructions.

Abstract

Text-guided 3D motion editing has seen success in single-person scenarios, but its extension to multi-person settings is less explored due to limited paired data and the complexity of inter-person interactions. We introduce the task of multi-person 3D motion editing, where a target motion is generated from a source and a text instruction. To support this, we propose InterEdit3D, a new dataset with manual two-person motion change annotations, and a Text-guided Multi-human Motion Editing (TMME) benchmark. We present InterEdit, a synchronized classifier-free conditional diffusion model for TMME. It introduces Semantic-Aware Plan Token Alignment with learnable tokens to capture high-level interaction cues and an Interaction-Aware Frequency Token Alignment strategy using DCT and energy pooling to model periodic motion dynamics. Experiments show that InterEdit improves text-to-motion consistency and edit fidelity, achieving state-of-the-art TMME performance. The dataset and code will be released at https://github.com/YNG916/InterEdit.

cs.CV cs.RO eess.IV

References (20)

MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Ziyan Guo, Zeyu Hu, Na Zhao et al.

2025 14 citations ⭐ Influential View Analysis →

InterGen: Diffusion-Based Multi-human Motion Generation Under Complex Interactions

Hanming Liang, Wenqian Zhang, Wenxu Li et al.

2023 204 citations ⭐ Influential View Analysis →

TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation

Yabiao Wang, Shuo Wang, Jiangning Zhang et al.

2024 13 citations ⭐ Influential View Analysis →

InterMask: 3D Human Interaction Generation via Collaborative Masked Modelling

Muhammad Gohar Javed, Chuan Guo, Li Cheng et al.

2024 31 citations ⭐ Influential View Analysis →

MotionFix: Text-Driven 3D Human Motion Editing

Nikos Athanasiou, Alp'ar Ceske, Markos Diomataris et al.

2024 50 citations ⭐ Influential View Analysis →

The KIT Motion-Language Dataset

Matthias Plappert, Christian Mandery, T. Asfour

2016 419 citations View Analysis →

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Mingyuan Zhang, Xinying Guo, Liang Pan et al.

2023 282 citations View Analysis →

LS-GAN: Human Motion Synthesis with Latent-Space GANs

Avinash Amballa, Gayathri Akkinapalli, Vinitra Muralikrishnan

2024 7 citations View Analysis →

InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint

Zhenzhi Wang, Jingbo Wang, Yixuan Li et al.

2023 16 citations View Analysis →

Motion Flow Matching for Human Motion Synthesis and Editing

Vincent Tao Hu, Wenzhe Yin, Pingchuan Ma et al.

2023 31 citations View Analysis →

Representation Learning with Contrastive Predictive Coding

Aäron van den Oord, Yazhe Li, O. Vinyals

2018 12703 citations View Analysis →

HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation

Mengge Liu, Yan Di, Gu Wang et al.

2026 1 citations View Analysis →

A hierarchical approach to interactive motion editing for human-like figures

Jehee Lee, Sung-yong Shin

1999 651 citations

PoseGPT: Quantization-based 3D Human Motion Generation and Forecasting

Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel et al.

2022 96 citations View Analysis →

EnergyMogen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space

Jianrong Zhang, Hehe Fan, Yi Yang

2024 13 citations View Analysis →

HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE

Xiaoyu Bie, Wen Guo, Simon Leglaive et al.

2022 19 citations View Analysis →

Skeleton-Aided Articulated Motion Generation

Yichao Yan, Jingwei Xu, Bingbing Ni et al.

2017 92 citations View Analysis →

in2IN: Leveraging individual Information to Generate Human INteractions

Pablo Ruiz-Ponce, Germán Barquero, Cristina Palmero et al.

2024 24 citations View Analysis →

Executing your Commands via Motion Diffusion in Latent Space

Xin Chen, Biao Jiang, Wen Liu et al.

2022 572 citations View Analysis →

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy et al.

2021 44882 citations View Analysis →

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

InterEdit

Semantic-Aware Plan Token Alignment

Interaction-Aware Frequency Token Alignment

Discrete Cosine Transform (DCT)

Energy Pooling

InterEdit3D Dataset

Text-guided Multi-human Motion Editing (TMME) Benchmark

Classifier-free Conditional Diffusion Model

Ablation Study

Edit Fidelity

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Animation Production

Virtual Reality

Human-Computer Interaction

Long-term Vision

Intelligent Education

Social Robots

Abstract

References (20)

Related Papers

Visual-ERM: Reward Modeling for Visual Equivalence

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams