HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions

TL;DR

HSImul3R uses physics feedback to optimize stable human-scene interaction 3D reconstructions, significantly enhancing simulation stability.

cs.CV 🔴 Advanced 2026-03-17 49 views
Yukang Cao Haozhe Xie Fangzhou Hong Long Zhuo Zhaoxi Chen Liang Pan Ziwei Liu
3D reconstruction human-scene interaction physics simulation machine learning robotics

Key Findings

Methodology

HSImul3R framework achieves simulation-ready 3D reconstruction of human-scene interactions through a physics-in-the-loop bidirectional optimization process. Forward optimization employs scene-targeted reinforcement learning to ensure motion fidelity and contact stability. Reverse optimization uses Direct Simulation Reward Optimization to refine scene geometry based on simulation feedback. This method integrates a physics simulator as an active supervisor to finely tune human dynamics and scene geometry.

Key Results

  • Result 1: HSImul3R significantly improved simulation stability on the HSIBench dataset, increasing stability from the baseline of 10.52% to 53.68%.
  • Result 2: In terms of image-to-3D generation quality, HSImul3R outperformed MIDI and DSO in both stability and geometric accuracy, achieving stability up to 87.23%.
  • Result 3: Through DSRO fine-tuning, HSImul3R demonstrated excellent performance across multiple scenarios, significantly reducing human-scene penetration issues.

Significance

HSImul3R holds significant implications for both academia and industry. It addresses the issue of visually plausible but physically unstable reconstructions in existing methods, providing a more reliable foundation for real-world robotic applications. By introducing a physics feedback mechanism, this method not only enhances simulation stability but also offers new insights for future research in agent interaction.

Technical Contribution

HSImul3R's technical contributions lie in its innovative physics feedback bidirectional optimization process. Unlike existing methods, it is the first to use a physics simulator as an active supervisor, ensuring the physical stability of reconstructions. Additionally, the method introduces a new dataset, HSIBench, enriching the research resources for human-scene interaction.

Novelty

HSImul3R is the first framework to combine physics feedback with 3D reconstruction, overcoming the issue of visual-physical inconsistency in traditional methods. Unlike existing 2D image space optimization methods, HSImul3R optimizes in 3D space, ensuring both geometric and physical validity.

Limitations

  • Limitation 1: In complex interactions or multi-object scenarios, HSImul3R's computational cost is high, potentially affecting real-time applications.
  • Limitation 2: In some cases, the reconstructed scene may still have structural defects, affecting simulation stability.
  • Limitation 3: For scenes with extreme occlusion, the accuracy of reconstructions may be compromised.

Future Work

Future research directions include optimizing computational efficiency to support real-time applications, expanding the HSIBench dataset to cover more complex scenarios, and exploring more physics feedback mechanisms to further improve reconstruction accuracy and stability.

AI Executive Summary

In modern AI research, 3D reconstruction of human-scene interactions is a critical area. However, existing methods often appear visually plausible but are physically unstable, leading to poor performance in physics engines and failing to meet real-world application needs.

To address this issue, Yukang Cao et al. proposed HSImul3R, an innovative framework that achieves simulation-ready 3D reconstruction of human-scene interactions through a physics-in-the-loop bidirectional optimization process. This method uses a physics simulator as an active supervisor to finely tune human dynamics and scene geometry, ensuring the physical stability of reconstructions.

The core technologies of HSImul3R include scene-targeted reinforcement learning and Direct Simulation Reward Optimization. The former provides dual supervision on motion fidelity and contact stability, while the latter refines scene geometry based on simulation feedback. The innovation of this method lies in introducing a physics feedback mechanism into the 3D reconstruction process, ensuring both geometric and physical validity.

Experimental results show that HSImul3R significantly improved simulation stability on the HSIBench dataset, increasing stability from the baseline of 10.52% to 53.68%. Additionally, in terms of image-to-3D generation quality, HSImul3R outperformed MIDI and DSO in both stability and geometric accuracy, achieving stability up to 87.23%.

The significance of this research lies in providing a more reliable foundation for real-world robotic applications, addressing the issue of visually plausible but physically unstable reconstructions in existing methods. By introducing a physics feedback mechanism, this method not only enhances simulation stability but also offers new insights for future research in agent interaction.

However, HSImul3R's computational cost is high in complex interactions or multi-object scenarios, potentially affecting real-time applications. Additionally, for scenes with extreme occlusion, the accuracy of reconstructions may be compromised. Future research directions include optimizing computational efficiency to support real-time applications, expanding the HSIBench dataset to cover more complex scenarios, and exploring more physics feedback mechanisms to further improve reconstruction accuracy and stability.

Deep Analysis

Background

3D reconstruction technology has made significant progress over the past decades, particularly in computer vision and robotics. Early methods such as structured light and multi-view stereo primarily relied on geometric information extraction. In recent years, the rise of deep learning has enabled monocular depth prediction and learning-based multi-view stereo, which perform well with sparse or unstructured imagery. However, despite advancements in static scene modeling, dynamic scene modeling remains a challenge. Existing methods like NeRF and DUSt3R have achieved some success in environmental geometry but still fall short in handling human dynamics and environmental physical coupling. Particularly in simulation and real-world applications, visually plausible reconstructions often lead to instability due to violations of physical constraints.

Core Problem

In existing 3D reconstruction methods, the issue of visually plausible but physically unstable reconstructions is pervasive. This visual-physical inconsistency primarily stems from the fragmented modeling of human dynamics and environmental geometry, leading to poor performance in physics engines and failing to meet real-world application needs. Especially in robotics and agent interaction, stable physical simulation is the foundation for reliable operation. Therefore, how to incorporate physics feedback mechanisms into 3D reconstruction to ensure physical stability is a key problem that needs to be addressed.

Innovation

The core innovations of HSImul3R lie in its physics feedback bidirectional optimization process. First, the method uses a physics simulator as an active supervisor to ensure the physical stability of reconstructions. Second, forward optimization employs scene-targeted reinforcement learning to ensure motion fidelity and contact stability. Finally, reverse optimization uses Direct Simulation Reward Optimization to refine scene geometry based on simulation feedback. Unlike existing 2D image space optimization methods, HSImul3R optimizes in 3D space, ensuring both geometric and physical validity.

Methodology

  • �� HSImul3R framework achieves simulation-ready 3D reconstruction of human-scene interactions through a physics-in-the-loop bidirectional optimization process.
  • �� Forward optimization employs scene-targeted reinforcement learning to ensure motion fidelity and contact stability.
  • �� Reverse optimization uses Direct Simulation Reward Optimization to refine scene geometry based on simulation feedback.
  • �� This method integrates a physics simulator as an active supervisor to finely tune human dynamics and scene geometry.
  • �� The introduction of the HSIBench dataset enriches the research resources for human-scene interaction.

Experiments

The experimental design includes simulation stability tests conducted on the HSIBench dataset. This dataset comprises 19 objects and over 50 motion sequences recorded by two male and one female participants, totaling 300 unique interaction instances. In the experiments, HSImul3R is compared against existing methods such as HSfM and MIDI, with evaluation metrics including simulation stability, image-to-3D generation quality, and reduction of human-scene penetration issues. The results demonstrate that HSImul3R significantly outperforms baseline methods across all metrics.

Results

Experimental results show that HSImul3R significantly improved simulation stability on the HSIBench dataset, increasing stability from the baseline of 10.52% to 53.68%. Additionally, in terms of image-to-3D generation quality, HSImul3R outperformed MIDI and DSO in both stability and geometric accuracy, achieving stability up to 87.23%. Through DSRO fine-tuning, HSImul3R demonstrated excellent performance across multiple scenarios, significantly reducing human-scene penetration issues.

Applications

Application scenarios for HSImul3R include robotics and agent interaction, virtual reality, and augmented reality. This method provides a more reliable foundation for real-world robotic applications, addressing the issue of visually plausible but physically unstable reconstructions in existing methods. By introducing a physics feedback mechanism, this method not only enhances simulation stability but also offers new insights for future research in agent interaction.

Limitations & Outlook

HSImul3R's computational cost is high in complex interactions or multi-object scenarios, potentially affecting real-time applications. Additionally, for scenes with extreme occlusion, the accuracy of reconstructions may be compromised. Future research directions include optimizing computational efficiency to support real-time applications, expanding the HSIBench dataset to cover more complex scenarios, and exploring more physics feedback mechanisms to further improve reconstruction accuracy and stability.

Plain Language Accessible to non-experts

Imagine you're building a LEGO scene, but you want it to not only look good but also stand stably in reality. HSImul3R is like having a smart assistant while building LEGO, which not only focuses on the appearance but also checks in real-time if each piece can stand stably under gravity. This assistant adjusts the position and angle of the pieces according to physical laws, ensuring the entire structure can exist stably in reality.

In this process, HSImul3R first generates an initial 3D model based on existing images, just like you build a rough LEGO model based on instructions. Then, it checks through a simulator whether this model is stable in reality, like gently pushing the LEGO model to see if it will collapse.

If the model is found to be unstable, HSImul3R makes adjustments, such as repositioning certain pieces or adding support structures, ensuring the entire model can exist stably in reality. Ultimately, you get not only a beautiful LEGO model but also a structure that can exist stably in reality.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where you need to build a virtual LEGO city. You want this city to not only look awesome but also run stably in the game. HSImul3R is like your game assistant, helping you make sure every building can stand firmly in the game.

First, HSImul3R generates an initial city model based on your design, just like you build a rough LEGO city based on game prompts. Then, it checks through the game's physics engine whether this city is stable in the game, like gently pushing the LEGO model to see if it will collapse.

If the city is found to be unstable, HSImul3R makes adjustments, such as repositioning certain buildings or adding support structures, ensuring the entire city can exist stably in the game. Ultimately, you get not only a beautiful LEGO city but also a city that can exist stably in the game.

So, next time you're building a city in the game, remember to let HSImul3R help you out; it will make your city cooler and more stable!

Glossary

HSImul3R

HSImul3R is a framework for simulation-ready 3D reconstruction of human-scene interactions, achieved through a physics-in-the-loop bidirectional optimization process.

Used to address the issue of visually plausible but physically unstable reconstructions.

Physics Feedback

Physics feedback refers to the process of using feedback information from a physics simulator to adjust and optimize 3D reconstructions.

Used to ensure the physical stability of reconstructions.

Bidirectional Optimization

Bidirectional optimization refers to optimizing in both forward and reverse directions to improve human dynamics and scene geometry simultaneously.

Used to finely tune human dynamics and scene geometry.

Scene-Targeted Reinforcement Learning

A reinforcement learning method used to optimize human motion, ensuring motion fidelity and contact stability.

Used in the forward optimization process.

Direct Simulation Reward Optimization

An optimization method that refines scene geometry using simulation feedback.

Used in the reverse optimization process.

HSIBench

HSIBench is a dataset containing various human-scene interaction scenarios, used to evaluate the performance of 3D reconstruction methods.

Used for experimental evaluation.

Simulation Stability

Simulation stability refers to whether the reconstructed scene can exist stably under gravity and interaction forces in a physics simulator.

Used to evaluate the physical validity of reconstruction methods.

Image-to-3D Generation

Image-to-3D generation refers to the process of generating a three-dimensional model from two-dimensional images.

Used for initial 3D model generation.

Human-Scene Penetration

Human-scene penetration refers to the phenomenon of mutual penetration between human models and scene models in 3D reconstruction.

Used to evaluate the geometric accuracy of reconstructions.

Physics Simulator

A physics simulator is a software tool used to simulate physical phenomena, capable of providing feedback on the physical validity of reconstructed models.

Used to provide physics feedback information.

Open Questions Unanswered questions from this research

  • 1 How to improve the computational efficiency of HSImul3R in complex interaction scenarios to support real-time applications? Existing methods have high computational costs in complex scenarios, affecting the possibility of real-time applications.
  • 2 How to expand the HSIBench dataset to cover more complex scenarios? The existing dataset still lacks diversity in scenarios, limiting the generalization ability of the method.
  • 3 How to improve reconstruction accuracy in scenes with extreme occlusion? Occlusion is a major challenge in 3D reconstruction, and existing methods still have room for improvement in handling occlusion.
  • 4 How to further optimize the physics feedback mechanism to improve reconstruction accuracy and stability? Existing physics feedback mechanisms may be insufficient to ensure the physical validity of reconstructions in some cases.
  • 5 How to ensure the physical stability of reconstructions in multi-object scenarios? Physical interactions in multi-object scenarios are complex, and existing methods still have room for improvement in handling multi-object scenarios.

Applications

Immediate Applications

Robotic Interaction

HSImul3R can be used to optimize robot-environment interactions, ensuring the stability and reliability of robotic operations.

Virtual Reality

In virtual reality, HSImul3R can be used to generate physically stable virtual environments, enhancing user experience.

Augmented Reality

In augmented reality applications, HSImul3R can be used to generate virtual objects consistent with the real environment, enhancing the realism of interactions.

Long-term Vision

Agent Interaction Research

HSImul3R provides new insights for future research in agent interaction, potentially promoting more complex interactions between agents and environments.

Large-Scale Dataset Generation

HSImul3R can be used to generate large-scale simulation datasets, supporting the training and optimization of machine learning models.

Abstract

We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.

cs.CV cs.RO

References (20)

Reconstructing People, Places, and Cameras

Lea Müller, Hongsuk Choi, Anthony Zhang et al.

2024 17 citations ⭐ Influential View Analysis →

Perpetual Humanoid Control for Real-time Simulated Avatars

Zhengyi Luo, Jinkun Cao, Alexander W. Winkler et al.

2023 232 citations ⭐ Influential View Analysis →

Retargeting Matters: General Motion Retargeting for Humanoid Motion Tracking

Joao Pedro Araujo, Yanjie Ze, Pei Xu et al.

2025 43 citations ⭐ Influential View Analysis →

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Zehuan Huang, Yuan-Chen Guo, Xingqiao An et al.

2024 54 citations ⭐ Influential View Analysis →

DiffMimic: Efficient Motion Mimicking with Differentiable Physics

Jiawei Ren, Cunjun Yu, Siwei Chen et al.

2023 23 citations View Analysis →

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu et al.

2024 2681 citations View Analysis →

Semantic Scene Completion from a Single Depth Image

Shuran Song, F. Yu, Andy Zeng et al.

2016 1383 citations View Analysis →

ECON: Explicit Clothed humans Optimized via Normal integration

Yuliang Xiu, Jinlong Yang, Xu Cao et al.

2022 248 citations View Analysis →

MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond

Yixuan Li, Lihan Jiang, Linning Xu et al.

2023 167 citations View Analysis →

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Haozhe Xie, Beichen Wen, Jia Zheng et al.

2026 2 citations View Analysis →

Visual Imitation Enables Contextual Humanoid Control

Arthur Allshire, Hongsuk Choi, Junyi Zhang et al.

2025 70 citations View Analysis →

2D Gaussian Splatting for Geometrically Accurate Radiance Fields

Binbin Huang, Zehao Yu, Anpei Chen et al.

2024 1072 citations View Analysis →

HOLD: Category-Agnostic 3D Reconstruction of Interacting Hands and Objects from Video

Zicong Fan, Maria Parelli, Maria Eleni Kadoglou et al.

2023 59 citations View Analysis →

PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics

Tianyi Xie, Zeshun Zong, Yuxing Qiu et al.

2023 348 citations View Analysis →

PhysPart: Physically Plausible Part Completion for Interactable Objects

Rundong Luo, Haoran Geng, Congyue Deng et al.

2024 24 citations View Analysis →

LoRA: Low-Rank Adaptation of Large Language Models

J. Hu, Yelong Shen, Phillip Wallis et al.

2021 17211 citations View Analysis →

2D Semantic-Guided Semantic Scene Completion

Xianzhu Liu, Haozhe Xie, Shengping Zhang et al.

2024 12 citations

Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication

Yunuo Chen, Tianyi Xie, Zeshun Zong et al.

2024 17 citations View Analysis →

Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination

Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro et al.

2024 28 citations View Analysis →

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Lihe Yang, Bingyi Kang, Zilong Huang et al.

2024 1585 citations View Analysis →