MUA: Mobile Ultra-detailed Animatable Avatars

TL;DR

MUA method achieves up to 2000X lower computational cost using Wavelet-guided Multi-level Spatial Factorized Blendshapes.

cs.CV 🔴 Advanced 2026-04-21 36 views

Heming Zhu Guoxing Sun Marc Habermann

AI Reader Arxiv Page Download PDF

animatable avatars computer graphics wavelet decomposition low-rank factorization real-time rendering

Key Findings

Methodology

This study proposes a novel animatable avatar representation called Wavelet-guided Multi-level Spatial Factorized Blendshapes, along with a corresponding distillation pipeline. By combining multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, the method transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation.

Key Results

Result 1: Compared to the original high-quality teacher avatar model, the MUA method achieves up to 2000X lower computational cost and a 10X smaller model size, while preserving visually plausible dynamics and appearance details closely resembling those of the teacher model.
Result 2: Extensive comparisons with existing avatar approaches designed for mobile settings show that the MUA method significantly outperforms existing methods and achieves comparable or superior rendering quality to most approaches that can only run on servers.
Result 3: The MUA method achieves over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.

Significance

This study significantly improves the practicality of high-fidelity avatars for immersive applications. By transferring the dynamics and details of an ultra-high-quality avatar model into a compact representation, the MUA method not only reduces computational costs but also enables high-quality rendering on resource-constrained platforms. This advancement addresses the long-standing trade-off between high fidelity and computational complexity in computer graphics and vision, offering new possibilities for applications in virtual and augmented reality.

Technical Contribution

The MUA method fundamentally differs from existing state-of-the-art methods. By integrating wavelet spectral decomposition and low-rank factorization, this method drastically reduces computational costs without sacrificing visual quality. Additionally, the MUA method opens new engineering possibilities, making high-quality animatable avatars feasible on mobile devices.

Novelty

The MUA method is the first to combine wavelet spectral decomposition with low-rank factorization for animatable avatar representation. This innovation not only stands out technically but also makes a breakthrough in resolving the trade-off between high fidelity and computational complexity.

Limitations

Limitation 1: Although the MUA method performs excellently in most scenarios, it may experience detail loss in extremely complex dynamic scenes.
Limitation 2: The method's performance might be limited on some low-end devices, especially when handling high-resolution textures.
Limitation 3: The MUA method's performance heavily relies on the quality of the pre-trained teacher model.

Future Work

Future research directions include further optimizing the MUA method to support more complex dynamic scenes and exploring the possibility of achieving efficient rendering on a wider range of devices. Additionally, researchers could explore how to achieve similar performance and quality without relying on pre-trained teacher models.

AI Executive Summary

Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Existing animatable avatar modeling methods have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms. However, existing approaches fail to achieve both goals simultaneously: ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts.

To bridge this gap, we propose a novel animatable avatar representation termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, along with a corresponding distillation pipeline. This method combines multi-level wavelet spectral decomposition with low-rank structural factorization in texture space to transfer motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation.

Extensive comparisons with existing avatar approaches designed for mobile settings show that the MUA method significantly outperforms existing methods and achieves comparable or superior rendering quality to most approaches that can only run on servers. The MUA method achieves over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.

Despite the MUA method's excellent performance in most scenarios, it may experience detail loss in extremely complex dynamic scenes. Future research directions include further optimizing the MUA method to support more complex dynamic scenes and exploring the possibility of achieving efficient rendering on a wider range of devices.

Deep Analysis

Background

In the field of computer graphics and vision, building photorealistic, animatable full-body digital humans has been a longstanding challenge. With the rapid development of virtual reality (VR) and augmented reality (AR) technologies, the demand for high-fidelity animatable avatars is increasing. Existing animatable avatar modeling methods have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms. However, existing approaches fail to achieve both goals simultaneously: ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts.

Core Problem

The core problem in high-fidelity animatable avatar modeling is how to reduce computational complexity without sacrificing visual quality. Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. Solving this problem is crucial for enabling high-quality rendering on resource-constrained platforms.

Innovation

The core innovation of the MUA method lies in combining wavelet spectral decomposition and low-rank factorization to achieve efficient animatable avatar representation. Specifically:

1) Wavelet Spectral Decomposition: By using multi-level wavelet spectral decomposition, the MUA method effectively captures the dynamic features of avatars.

2) Low-rank Factorization: Low-rank factorization in texture space is used to achieve efficient representation of fine-grained appearance details.

3) Distillation Pipeline: A distillation pipeline is designed to transfer motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation.

Methodology

The implementation of the MUA method includes the following steps:

�� Wavelet Spectral Decomposition: Multi-level wavelet spectral decomposition is applied to capture the dynamic features of avatars.
�� Low-rank Factorization: Low-rank factorization is performed in texture space to achieve efficient representation of fine-grained appearance details.
�� Distillation Pipeline: A distillation pipeline is designed to transfer motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation.
�� Model Compression: By combining wavelet spectral decomposition and low-rank factorization, the MUA method achieves up to 2000X lower computational cost and a 10X smaller model size.

Experiments

The experimental design includes extensive comparisons and validations using multiple datasets. Benchmarks include existing state-of-the-art avatar methods and performance tests on different devices. Key hyperparameters include the number of levels in wavelet spectral decomposition and the dimensions of low-rank factorization. Experiments also include ablation studies to verify the contribution of each component.

Results

Experimental results show that the MUA method significantly outperforms existing methods on multiple benchmarks. Specifically, compared to the original high-quality teacher avatar model, the MUA method achieves up to 2000X lower computational cost and a 10X smaller model size. Additionally, the MUA method achieves over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3. Ablation studies indicate that wavelet spectral decomposition and low-rank factorization play key roles in achieving efficient representation.

Applications

Application scenarios for the MUA method include high-fidelity animatable avatars in virtual and augmented reality. By reducing computational costs and model size, the MUA method enables high-quality rendering on resource-constrained platforms. This advancement offers new possibilities for applications in gaming, film production, and virtual social interactions.

Limitations & Outlook

Despite the MUA method's excellent performance in most scenarios, it may experience detail loss in extremely complex dynamic scenes. Additionally, the method's performance might be limited on some low-end devices, especially when handling high-resolution textures. Future research directions include further optimizing the MUA method to support more complex dynamic scenes and exploring the possibility of achieving efficient rendering on a wider range of devices.

Plain Language Accessible to non-experts

Imagine you're cooking in a kitchen. You have a large and complex recipe that requires many steps and tools, but you only have a small kitchen and limited time. The MUA method is like a clever chef who can simplify the complex recipe into a few key steps while still keeping it delicious. By using wavelet spectral decomposition and low-rank factorization, this chef can drastically reduce the steps and tools needed without sacrificing taste. It's like turning a complex three-course meal into a simple yet delicious single dish. The MUA method makes high-quality cooking possible in a small kitchen, just like achieving high-quality animatable avatars on resource-constrained platforms.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where your character looks just like a real person! But the problem is, these realistic characters usually need super powerful computers to run, just like you need a super fast car to win a race. The MUA method is like a magic tool that lets your character look realistic even on a regular computer! It's like giving your car a super engine so you can speed through the race track even on a regular road. This method uses some clever tricks, like wavelet spectral decomposition and low-rank factorization, just like putting a super lightweight armor on your character, allowing it to move freely in the game without needing a super powerful computer to support it. Isn't that cool?

Glossary

Wavelet Spectral Decomposition

A mathematical technique used to decompose signals into components of different frequencies for easier analysis and processing.

Used in the MUA method to capture the dynamic features of avatars.

Low-rank Factorization

A matrix decomposition technique that reduces data complexity by decomposing a matrix into a product of lower-rank matrices.

Used in the MUA method to achieve efficient representation of fine-grained appearance details.

Distillation Pipeline

A technique for transferring knowledge from a complex model to a simpler model to reduce computational costs.

Used to transfer dynamics and details from a pre-trained ultra-high-quality avatar model into a compact representation.

Animatable Avatar

A digital avatar that can be animated and interacted with based on user input.

The core application object of the MUA method.

High-fidelity

Refers to having extremely high detail and realism in digital representation.

The MUA method aims to maintain high fidelity while reducing computational costs.

Computational Complexity

A measure of the resources (such as time and space) required by an algorithm during execution.

The MUA method reduces computational complexity to achieve efficient rendering.

Resource-constrained Platform

Refers to devices with limited computational resources, such as mobile devices and VR headsets.

The MUA method aims to enable high-quality rendering on these platforms.

Real-time Rendering

The ability to generate images instantly during user interaction.

The MUA method achieves real-time rendering on desktop PCs and Meta Quest 3.

Meta Quest 3

A standalone virtual reality headset capable of running applications without an external computer.

The MUA method achieves 24 FPS real-time performance on Meta Quest 3.

Ablation Study

An experimental method that evaluates the contribution of certain parts of a model by gradually removing them.

Used to verify the contribution of each component in the MUA method.

Open Questions Unanswered questions from this research

1 How to achieve similar performance and quality without pre-trained teacher models? Current methods rely on high-quality teacher models, limiting their applicability in some scenarios. Future research needs to explore how to achieve efficient animatable avatar representation without teacher models.
2 How to avoid detail loss in extremely complex dynamic scenes? Although the MUA method performs excellently in most cases, it may experience detail loss when dealing with complex dynamic scenes. Further research is needed to maintain high fidelity in these scenarios.
3 How to further reduce the computational cost of the MUA method? Although the MUA method has significantly reduced computational costs, it may still be limited on some low-end devices. Future research could explore more efficient algorithms and data structures.
4 How to achieve efficient rendering on a wider range of devices? Current research focuses mainly on desktop PCs and Meta Quest 3. Future research could explore the possibility of achieving efficient rendering on other devices.
5 How to further compress the model size without sacrificing visual quality? Although the MUA method has achieved a 10X smaller model size, smaller models may still be needed in some applications.

Applications

Immediate Applications

Virtual Reality Gaming

By reducing computational costs, the MUA method enables high-quality animatable avatars in VR games, enhancing player immersion and gaming experience.

Film Production

In film production, the MUA method can be used to create realistic digital characters, reducing production time and costs.

Virtual Social Platforms

The MUA method can be used in virtual social platforms to enable users to interact and communicate in a more realistic manner.

Long-term Vision

Education and Training

By using high-fidelity animatable avatars in education and training, the MUA method can improve learning outcomes and engagement.

Healthcare and Rehabilitation

In healthcare and rehabilitation, the MUA method can be used to create realistic virtual patients and training environments, enhancing treatment outcomes.

Abstract

Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Recent advances in animatable avatar modeling have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms, e.g., VR headsets. However, existing approaches fail to achieve both goals simultaneously: Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. To bridge this gap, we propose a novel animatable avatar representation, termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, and a corresponding distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation. By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resemble those of the teacher model. Extensive comparisons with state-of-the-art methods show that our approach significantly outperforms existing avatar approaches designed for mobile settings and achieves comparable or superior rendering quality to most approaches that can only run on servers. Importantly, our representation substantially improves the practicality of high-fidelity avatars for immersive applications, achieving over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.

cs.CV

References (20)

ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering

Haokai Pang, Heming Zhu, A. Kortylewski et al.

2023 119 citations ⭐ Influential View Analysis →

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic et al.

2023 231 citations ⭐ Influential View Analysis →

Principal Components Analysis (PCA)

John M. Hancock

2014 537 citations ⭐ Influential

Animatable Gaussians: Learning Pose-Dependent Gaussian Maps for High-Fidelity Human Avatar Modeling

Zhe Li, Zerong Zheng, Lizhen Wang et al.

2024 238 citations ⭐ Influential

Expressive Body Capture: 3D Hands, Face, and Body From a Single Image

G. Pavlakos, Vasileios Choutas, N. Ghorbani et al.

2019 2293 citations ⭐ Influential View Analysis →

UMA: Ultra-detailed Human Avatars via Multi-level Surface Alignment

Heming Zhu, Guoxing Sun, C. Theobalt et al.

2025 2 citations ⭐ Influential View Analysis →

Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies

H. Joo, T. Simon, Yaser Sheikh

2018 559 citations View Analysis →

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014 165017 citations View Analysis →

HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion

Mustafa Işık, Martin Rünz, Markos Georgopoulos et al.

2023 211 citations View Analysis →

Detailed Human Avatars from Monocular Video

Thiemo Alldieck, M. Magnor, Weipeng Xu et al.

2018 251 citations View Analysis →

Embedded deformation for shape manipulation

R. Sumner, Johannes Schmid, M. Pauly

2007 671 citations

Video-based reconstruction of animatable human characters

C. Stoll, Juergen Gall, Edilson de Aguiar et al.

2010 158 citations

Exploring the design space of immersive urban analytics

Zhutian Chen, Yifang Wang, Tiancheng Sun et al.

2017 52 citations View Analysis →

Skinning with dual quaternions

L. Kavan, S. Collins, J. Zára et al.

2007 352 citations

AvatarReX: Real-time Expressive Full-body Avatars

Zerong Zheng, Xiaochen Zhao, Hongwen Zhang et al.

2023 119 citations View Analysis →

LoRA: Low-Rank Adaptation of Large Language Models

J. Hu, Yelong Shen, Phillip Wallis et al.

2021 18198 citations View Analysis →

Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans

Sida Peng, Yuanqing Zhang, Yinghao Xu et al.

2020 842 citations View Analysis →

4D video textures for interactive character appearance

D. Casas, M. Volino, J. Collomosse et al.

2014 109 citations

Automatic differentiation in PyTorch

Adam Paszke, Sam Gross, Soumith Chintala et al.

2017 16035 citations

Real-time deep dynamic characters

Marc Habermann, Lingjie Liu, Weipeng Xu et al.

2021 187 citations View Analysis →

MUA: Mobile Ultra-detailed Animatable Avatars

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Wavelet Spectral Decomposition

Low-rank Factorization

Distillation Pipeline

Animatable Avatar

High-fidelity

Computational Complexity

Resource-constrained Platform

Real-time Rendering

Meta Quest 3

Ablation Study

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Virtual Reality Gaming

Film Production

Virtual Social Platforms

Long-term Vision

Education and Training

Healthcare and Rehabilitation

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock