MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On

TL;DR

MOFA-VTON employs diffusion models with dual-region masks and cross-attention-based layout adjustment, enabling user-controlled, fine-grained virtual try-on with diverse styles.

cs.CV 🔴 Advanced 2026-06-10 55 views

Xiaoyu Han Chenyang Wang Jing Wang Shunyuan Zheng Quanling Meng Shengping Zhang

AI Reader Arxiv Page Download PDF

virtual try-on deep learning generative models interactive control layout refinement

Key Findings

Methodology

MOFA-VTON adopts a diffusion-based generative framework, integrating a dual-region mask constructed from user-drawn sketches and a layout adjustment mechanism utilizing cross-attention. The core components include: • Designing a DensePose-based dual-region mask that transforms user sketches into explicit upper and lower body regions, providing detailed spatial guidance; • Introducing layout adjustment blocks with cross-attention to independently learn the spatial correspondence of upper and lower regions, refining clothing placement; • Extracting multi-level clothing features via CLIP, Cloth-Net, and region encoders, which are injected into the diffusion model to ensure detailed and style-consistent outputs; • Employing an enhanced UNet architecture (Adapt-Net) as the backbone, combining multi-scale feature fusion and conditional guidance for high-quality try-on results. The process involves mask construction, feature extraction, spatial layout refinement, and iterative denoising conditioned on user input, enabling flexible, personalized virtual try-on effects.

Key Results

On the VITON-HD and DressCode datasets, MOFA-VTON outperforms state-of-the-art methods in metrics such as FID, LPIPS, and SSIM. For example, on VITON-HD, it achieves an FID of 5.97, better than IDM-VTON's 6.45, indicating superior image quality and detail preservation. Subjective evaluations show over 78% of results are rated as natural and diverse, with the model successfully generating various styling effects like tucked, untucked, curved hems, and waist-revealing designs, surpassing fixed-layout limitations of prior methods.
Ablation studies demonstrate that removing the dual-region mask or layout adjustment blocks results in approximately 10% decrease in quality metrics, confirming their critical roles. The model also exhibits robustness in complex scenarios involving intricate clothing details and diverse poses. User preference surveys indicate that more than 85% of users favor the diverse and controllable outputs generated by MOFA-VTON, highlighting its practical advantages.
The experiments validate that the proposed approach significantly enhances the flexibility and realism of virtual try-on, enabling users to explore a broad range of clothing styles and arrangements while maintaining high fidelity and natural appearance.

Significance

This research fundamentally advances virtual try-on technology by enabling user-guided, fine-grained control over clothing layout, breaking free from the traditional fixed-position paradigm. It addresses the long-standing challenge of limited diversity in try-on results, opening new avenues for personalized fashion visualization, virtual fitting rooms, and digital wardrobe management. The integration of diffusion models with explicit spatial guidance mechanisms offers a scalable, high-quality solution that can adapt to various clothing styles and user preferences, thus significantly impacting both academia and industry by fostering more interactive, realistic, and customizable virtual dressing experiences.

Technical Contribution

The paper introduces a novel diffusion-based architecture that combines a dual-region mask derived from user sketches with a cross-attention-driven layout adjustment module. This design allows independent learning of upper and lower body clothing placement, enabling dynamic spatial control. The multi-level feature extraction from CLIP, Cloth-Net, and region encoders ensures detailed texture and style fidelity. The improved Adapt-Net architecture incorporates coarse and fine feature injections, along with region-specific feature refinement, resulting in high-resolution, realistic try-on outputs with user-controllable variations. These innovations collectively push the boundaries of controllability and quality in generative virtual try-on systems.

Novelty

This work is the first to incorporate user-drawn curve sketches as explicit spatial guidance for virtual try-on, utilizing a dual-region mask construction based on DensePose and a cross-attention-based layout adjustment mechanism. Unlike prior methods limited to fixed overlays or point-based controls, MOFA-VTON offers intuitive, pixel-level manipulation of clothing layout, enabling diverse and personalized styling effects. Its integration of diffusion models with explicit spatial conditioning represents a significant leap forward in the field, bridging the gap between user interactivity and high-fidelity image synthesis.

Limitations

The model's performance diminishes with highly complex clothing featuring extensive decorations or intricate textures, due to limitations in feature extraction and spatial refinement accuracy.
High computational costs associated with training and inference of diffusion models may hinder real-time applications, especially on resource-constrained devices.
The quality of user-drawn sketches heavily influences the output; imprecise or ambiguous sketches can lead to unnatural or undesired results, indicating a need for more robust sketch understanding.

Future Work

Future research will focus on enhancing sketch understanding through semantic segmentation and AI-assisted refinement, reducing user input errors. Additionally, integrating real-time feedback mechanisms and multi-modal controls (e.g., text, pose) could further improve interactivity. Exploring lightweight diffusion architectures and domain adaptation strategies will aim to enable deployment on consumer devices, broadening practical applications. Long-term, combining virtual try-on with AR/VR technologies may realize immersive virtual fitting experiences, transforming online shopping and digital fashion design.

AI Executive Summary

Virtual try-on technology has become a pivotal component in online fashion retail and digital clothing visualization, aiming to provide consumers with realistic previews of garments without physical trials. Despite rapid progress, most existing methods are constrained by rigid layout assumptions, often merely overlaying clothing images onto human figures in fixed positions. This limitation hampers personalization, as users cannot easily modify how clothes are worn or styled, leading to monotonous results that lack diversity.

To address this challenge, the authors introduce MOFA-VTON, a novel framework that leverages the power of diffusion models combined with explicit spatial guidance derived from user sketches. The core innovation lies in enabling users to draw simple curves that define the desired clothing layout, which are then transformed into a dual-region mask based on DensePose mappings. This mask explicitly separates the upper and lower body regions, providing detailed spatial cues for the generation process.

The architecture incorporates a layout adjustment mechanism utilizing cross-attention modules. These modules independently learn the correspondence between the user-defined regions and the generated clothing features, allowing the model to dynamically reposition garments within the specified layout. The feature extraction process employs a combination of CLIP, Cloth-Net, and region encoders, capturing multi-scale details and ensuring style consistency. The backbone, Adapt-Net, is a refined UNet that integrates these features through coarse and fine feature injections, guided by the layout adjustment blocks.

Extensive experiments on the VITON-HD and DressCode datasets demonstrate that MOFA-VTON surpasses previous state-of-the-art methods in both quantitative metrics and subjective evaluations. The model achieves lower FID scores (e.g., 5.97 vs. 6.45 for IDM-VTON), higher SSIM, and more diverse styling effects, such as different hemlines, tucked or untucked styles, and waist-revealing designs. User studies reveal over 85% preference for MOFA-VTON's outputs, citing its natural appearance and controllability.

This work marks a significant step toward truly interactive virtual try-on systems, where users can intuitively manipulate clothing layouts with simple sketches. It opens new possibilities for personalized fashion experiences, virtual fitting rooms, and digital wardrobe management. Future directions include improving sketch robustness, reducing computational costs, and integrating immersive AR/VR environments to realize seamless virtual dressing experiences that adapt to individual preferences and real-time feedback.

Deep Dive

Abstract

Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.

cs.CV

MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence