MessyKitchens: Contact-rich object-level 3D scene reconstruction

TL;DR

MessyKitchens achieves high-precision monocular 3D scene reconstruction using the MOD algorithm, significantly enhancing the physical plausibility of inter-object contacts.

cs.CV 🔴 Advanced 2026-03-18 36 views

Junaid Ahmed Ansari Ran Ding Fabio Pizzati Ivan Laptev

AI Reader Arxiv Page Download PDF

3D reconstruction monocular depth estimation object-level scene reconstruction dataset physical plausibility

Key Findings

Methodology

This paper presents a novel 3D scene reconstruction method combining the MessyKitchens dataset and the Multi-Object Decoder (MOD). The MessyKitchens dataset features complex real-world kitchen scenes with high-fidelity 3D object shapes, poses, and accurate object contact information. The MOD algorithm extends the SAM 3D single-object reconstruction framework to simultaneously predict the geometry and poses of multiple objects in a scene, achieving physically plausible scene reconstruction.

Key Results

On the MessyKitchens dataset, the MOD algorithm significantly outperforms existing datasets in terms of inter-object contact and object registration accuracy, with an average depth error of only 1.62 mm, compared to GraspClutter6D's error of 3.22 mm.
The MOD algorithm achieves object-level IoU scores of 0.445, 0.344, and 0.404 on the MessyKitchens, GraspNet-1B, and HouseCat6D datasets, respectively, outperforming SAM 3D and other baseline methods.
Ablation studies confirm that the MOD algorithm improves scene-level IoU by approximately 10%, demonstrating better handling of inter-object interactions in complex scenes.

Significance

This research holds significant implications for both academia and industry. It not only provides a new high-fidelity dataset, MessyKitchens, but also introduces a method capable of achieving physically plausible multi-object 3D scene reconstruction. This method addresses long-standing issues of inaccurate inter-object physical interactions in robotics and animation applications, laying a solid foundation for future research and applications.

Technical Contribution

The technical contributions of this paper include the introduction of the Multi-Object Decoder (MOD), which extends single-object reconstruction to handle multiple objects' geometry and pose predictions simultaneously. Compared to existing methods, MOD not only improves the physical plausibility of inter-object contacts but also achieves higher reconstruction accuracy across different datasets.

Novelty

MessyKitchens is the first high-fidelity dataset focusing on complex kitchen scenes, and the MOD algorithm is the first decoder to extend single-object reconstruction to multiple objects, significantly enhancing the physical plausibility and accuracy of scene reconstruction.

Limitations

The MOD algorithm may encounter high computational costs when dealing with very complex scenes, especially those involving a large number of objects.
While the MessyKitchens dataset performs excellently in kitchen scenes, its generalizability to other types of scenes needs further validation.
The dataset construction process, relying on high-precision object scanning, is relatively complex, potentially limiting its direct application in other fields.

Future Work

Future research directions include extending the applicability of the MOD algorithm to other complex environments and optimizing the algorithm's computational efficiency for real-time applications.

AI Executive Summary

Monocular 3D scene reconstruction has seen significant progress in recent years; however, decomposing complex scenes into individual 3D objects remains a challenge. Existing methods struggle with object diversity, occlusions, and complex object relations, particularly in applications like robotics and animation, where physical plausibility of inter-object interactions is crucial.

To address these challenges, this paper introduces the MessyKitchens dataset and the Multi-Object Decoder (MOD). The MessyKitchens dataset comprises 100 real-world kitchen scenes, providing high-fidelity 3D object shapes, poses, and accurate object contact information. The MOD algorithm extends the SAM 3D single-object reconstruction framework to simultaneously predict the geometry and poses of multiple objects in a scene.

By reconstructing objects simultaneously, the MOD algorithm captures contextual relationships and enforces more physically plausible configurations. Experimental results demonstrate that the MessyKitchens dataset significantly outperforms existing datasets in terms of object registration accuracy and the physical plausibility of inter-object contacts. The MOD algorithm performs excellently on the MessyKitchens, GraspNet-1B, and HouseCat6D datasets, particularly in object-level and scene-level IoU metrics.

This research not only provides a new high-fidelity dataset but also introduces a method capable of achieving physically plausible multi-object 3D scene reconstruction. It addresses long-standing issues of inaccurate inter-object physical interactions in robotics and animation applications, laying a solid foundation for future research and applications.

However, the MOD algorithm may encounter high computational costs when dealing with very complex scenes, especially those involving a large number of objects. Additionally, while the MessyKitchens dataset performs excellently in kitchen scenes, its generalizability to other types of scenes needs further validation. Future research directions include extending the applicability of the MOD algorithm to other complex environments and optimizing the algorithm's computational efficiency for real-time applications.

Deep Analysis

Background

3D scene reconstruction plays a pivotal role in digital arts, content creation, industrial inspection, surgery, heritage preservation, navigation, and robot learning and simulation. Traditional geometry-based methods have gradually been replaced by learning-based approaches, which rely on learned inductive biases to achieve accurate shape predictions from a single image. Recent methods such as DepthAnything, VGGT, and Gen3C have made significant advances in monocular depth estimation. However, object-level scene reconstruction has received relatively less attention. Existing methods like MIDI and PartCrafter show impressive results in synthetic scenes, while SAM 3D enables the estimation of the shape and pose of single objects in real images. Nevertheless, progress in object-level scene reconstruction also requires realistic and high-fidelity benchmarks for training and evaluation.

Core Problem

The core problem of object-level 3D scene reconstruction is accurately decomposing and reconstructing individual objects in complex scenes. This task is challenging due to the large variety of object shapes, frequent occlusions, and complex object relations. Moreover, applications in robotics and animation require physically plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. These requirements make it difficult for existing methods to achieve high-precision reconstruction in complex scenes.

Innovation

The core innovations of this paper include:

1. Introducing the MessyKitchens dataset, which features complex real-world kitchen scenes with high-fidelity 3D object shapes, poses, and accurate object contact information.

2. Proposing the Multi-Object Decoder (MOD), which extends the SAM 3D single-object reconstruction framework to simultaneously predict the geometry and poses of multiple objects in a scene.

3. The MOD algorithm captures contextual relationships by reconstructing multiple objects simultaneously, enforcing more physically plausible configurations.

Methodology

�� Construction of the MessyKitchens dataset: 100 real scenes were collected, each composed of a variable number of kitchenware objects, using an Einstar Vega 3D scanner for high-precision scanning.
�� Difficulty levels of the dataset: Scenes were categorized into easy, medium, and hard levels based on inter-object contact and complexity.
�� Design of the Multi-Object Decoder (MOD): Built upon the SAM 3D framework, the MOD adds a decoder capable of simultaneously predicting the geometry and poses of multiple objects.
�� Experimental design: Evaluated the performance of the MOD algorithm on the MessyKitchens, GraspNet-1B, and HouseCat6D datasets, comparing it with existing baseline methods.

Experiments

The experimental design includes evaluating the MOD algorithm's performance on the MessyKitchens, GraspNet-1B, and HouseCat6D datasets. Baseline methods used include PartCrafter, MIDI, and SAM 3D. Evaluation metrics include object-level and scene-level IoU and Chamfer Distance. The experiments also include ablation studies to verify the MOD algorithm's performance in different scenarios.

Results

Experimental results show that the MOD algorithm significantly improves the physical plausibility of inter-object contacts and object registration accuracy on the MessyKitchens dataset. Specifically, the MOD achieves object-level IoU scores of 0.445, 0.344, and 0.404 on the MessyKitchens, GraspNet-1B, and HouseCat6D datasets, respectively, outperforming SAM 3D and other baseline methods. Additionally, the MOD also excels in scene-level IoU, demonstrating better handling of inter-object interactions in complex scenes.

Applications

The MOD algorithm has direct application scenarios in robotics and animation, particularly in tasks requiring physically plausible inter-object interactions. Its high-precision object reconstruction capability can be used in industrial inspection, surgical planning, and virtual reality scene construction.

Limitations & Outlook

Despite the MOD algorithm's excellent performance in complex scenes, it may encounter high computational costs, especially when dealing with a large number of objects. Additionally, while the MessyKitchens dataset performs excellently in kitchen scenes, its generalizability to other types of scenes needs further validation. Future research directions include extending the applicability of the MOD algorithm to other complex environments and optimizing the algorithm's computational efficiency for real-time applications.

Plain Language Accessible to non-experts

Imagine you're in a kitchen, and the table is cluttered with various dishes and ingredients. You need to take a picture of this table with a camera and then use a computer to reconstruct a 3D model that accurately shows the position and shape of each object. The MessyKitchens dataset is like a detailed kitchen guide that helps you identify the shape and position of each object. The Multi-Object Decoder (MOD) acts like a smart assistant that not only recognizes each object but also understands their relationships, such as which plate is stacked on which bowl, or which spoon is inserted into which cup. In this way, MOD can create a realistic 3D scene, allowing you to experience the kitchen's authenticity in a virtual world.

ELI14 Explained like you're 14

Hey there, imagine you're playing a super cool 3D game with a super complex kitchen scene filled with all kinds of pots and pans. You need to use a special camera to capture this scene and then use a computer to turn it into a 3D model. The MessyKitchens dataset is like the game's guide, telling you the shape and position of each object. And MOD is like a super smart assistant that not only recognizes each object but also understands their relationships, like which plate is stacked on which bowl. So, in the game, you can see a super realistic kitchen scene, just like you're really in the kitchen! Isn't that awesome?

Glossary

MessyKitchens

MessyKitchens is a high-fidelity dataset featuring complex kitchen scenes, providing 3D object shapes, poses, and accurate object contact information.

Used in the paper to validate the performance of the MOD algorithm.

Multi-Object Decoder (MOD)

MOD is an algorithm that extends SAM 3D to simultaneously predict the geometry and poses of multiple objects, achieving physically plausible scene reconstruction.

Used to improve the physical plausibility of inter-object contacts.

SAM 3D

SAM 3D is a framework for single-object reconstruction, capable of estimating the shape and pose of single objects in real images.

MOD algorithm is built upon this framework.

Object-level Scene Reconstruction

Object-level scene reconstruction involves decomposing complex scenes into individual 3D objects and accurately reconstructing each object's shape and pose.

The core problem addressed in this paper.

Chamfer Distance

Chamfer Distance is a metric used to evaluate the similarity between two point clouds, commonly used for assessing 3D reconstruction accuracy.

Used to evaluate the reconstruction accuracy of the MOD algorithm.

IoU

IoU (Intersection over Union) is a metric for evaluating the overlap between two shapes, commonly used for image segmentation and 3D reconstruction accuracy.

Used to evaluate the reconstruction accuracy of the MOD algorithm.

Object Registration Accuracy

Object registration accuracy refers to the error between the predicted and true positions of objects in a 3D scene.

Used to evaluate the quality of the MessyKitchens dataset.

Physical Plausibility

Physical plausibility refers to the adherence of inter-object interactions in a 3D scene to physical principles such as non-penetration and realistic contacts.

An important evaluation metric for the MOD algorithm.

Depth Error

Depth error is the difference between the predicted and true depths in 3D reconstruction.

Used to evaluate the quality of the MessyKitchens dataset.

Ablation Study

An ablation study is a method of evaluating the impact of removing or modifying certain parts of a model on its overall performance.

Used to verify the performance of the MOD algorithm in different scenarios.

Open Questions Unanswered questions from this research

1 Existing 3D scene reconstruction methods still face challenges with inaccurate inter-object interactions in complex scenes. How to improve the physical plausibility of reconstruction without increasing computational costs is an unsolved problem.
2 While the MessyKitchens dataset performs excellently in kitchen scenes, its generalizability to other types of scenes needs further validation. How to extend the dataset's applicability is a future research direction.
3 The MOD algorithm encounters high computational costs when dealing with a large number of objects. How to optimize the algorithm's computational efficiency for real-time applications is worth exploring.
4 Although the MOD algorithm performs excellently in object-level and scene-level reconstruction accuracy, errors may still occur in very complex scenes. How to further improve the algorithm's robustness is a research direction.
5 Most existing 3D reconstruction methods rely on high-precision object scanning, limiting their direct application in other fields. How to achieve high-precision reconstruction without relying on high-precision scanning is an important research topic.

Applications

Immediate Applications

Robotic Grasping

The MOD algorithm can be used in robotic grasping tasks to help robots recognize and grasp objects in complex scenes. Its high-precision object reconstruction capability can improve grasping success rates.

Virtual Reality

In virtual reality applications, the MOD algorithm can be used to construct realistic 3D scenes, enhancing user immersion. Its physically plausible scene reconstruction capability can enhance user experience.

Industrial Inspection

The MOD algorithm can be used in industrial inspection to help identify and detect objects in complex scenes. Its high-precision object reconstruction capability can improve detection accuracy.

Long-term Vision

Autonomous Driving

In the field of autonomous driving, the MOD algorithm can be used to recognize and predict objects in complex traffic scenes, improving the safety and reliability of autonomous driving systems.

Smart Home

In smart home applications, the MOD algorithm can be used to recognize and control objects in the home environment, enhancing the intelligence of smart home systems.

Abstract

Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: https://messykitchens.github.io/.

cs.CV cs.AI cs.RO

References (20)

T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-Less Objects

Tomas Hodan, Pavel Haluza, Stepán Obdrzálek et al.

2017 573 citations ⭐ Influential View Analysis →

GraspClutter6D: A Large-Scale Real-World Dataset for Robust Perception and Grasping in Cluttered Scenes

Seunghyeok Back, Joosoon Lee, Kangmin Kim et al.

2025 5 citations ⭐ Influential View Analysis →

GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping

Haoshu Fang, Chenxi Wang, Minghao Gou et al.

2020 753 citations ⭐ Influential

SAM 3D: 3Dfy Anything in Images

S. Team, Xingyu Chen, Fu-Jen Chu et al.

2025 46 citations ⭐ Influential View Analysis →

PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers

Yuchen Lin, Chenguo Lin, Panwang Pan et al.

2025 39 citations ⭐ Influential View Analysis →

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Zehuan Huang, Yuan-Chen Guo, Xingqiao An et al.

2024 56 citations ⭐ Influential View Analysis →

HouseCat6D - A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios

Hyunjun Jung, Guangyao Zhai, Shun-cheng Wu et al.

2022 49 citations ⭐ Influential View Analysis →

TARGO: Benchmarking Target-driven Object Grasping under Occlusions

Yan Xia, Ran Ding, Ziyuan Qin et al.

2024 7 citations View Analysis →

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Lihe Yang, Bingyi Kang, Zilong Huang et al.

2024 1591 citations View Analysis →

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng et al.

2024 973 citations View Analysis →

ShapeNet: An Information-Rich 3D Model Repository

Angel X. Chang, T. Funkhouser, L. Guibas et al.

2015 6253 citations View Analysis →

PhoCaL: A Multi-Modal Dataset for Category-Level Object Pose Estimation with Photometrically Challenging Objects

Pengyuan Wang, Hyunjun Jung, Yitong Li et al.

2022 57 citations View Analysis →

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

Zijie Wu, Chaohui Yu, Fan Wang et al.

2025 14 citations View Analysis →

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu et al.

2025 134 citations View Analysis →

DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

Daoyi Gao, Dávid Rozenberszki, Stefan Leutenegger et al.

2023 30 citations View Analysis →

ROCA: Robust CAD Model Retrieval and Alignment from a Single Image

Can Gümeli, Angela Dai, M. Nießner

2021 67 citations View Analysis →

MP6D: An RGB-D Dataset for Metal Parts’ 6D Pose Estimation

Long Chen, Han Yang, Chenrui Wu et al.

2022 27 citations

SciPy 1.0: fundamental algorithms for scientific computing in Python

Pauli Virtanen, R. Gommers, T. Oliphant et al.

2019 30401 citations

A Method for Registration of 3-D Shapes

P. Besl, Neil D. McKay

1992 20999 citations

DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin et al.

2023 117 citations View Analysis →

MessyKitchens: Contact-rich object-level 3D scene reconstruction

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

MessyKitchens

Multi-Object Decoder (MOD)

SAM 3D

Object-level Scene Reconstruction

Chamfer Distance

IoU

Object Registration Accuracy

Physical Plausibility

Depth Error

Ablation Study

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Robotic Grasping

Virtual Reality

Industrial Inspection

Long-term Vision

Autonomous Driving

Smart Home

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock