HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

TL;DR

HomeWorld introduces a hierarchical, multimodal framework trained on 300K real floorplans, using LLMs and diffusion models to generate controllable, diverse, and realistic whole-home scenes.

cs.CV 🔴 Advanced 2026-06-05 97 views

Wenbo Li Xiaoliang Ju Zipeng Qin Rongyao Fang Hongsheng Li

AI Reader Arxiv Page Download PDF

indoor scene synthesis hierarchical modeling large-scale dataset multimodal fusion virtual simulation

Key Findings

Methodology

This approach combines a large-scale real-world floorplan dataset with a prompt-conditioned large language model (LLM) to generate structured floorplans represented via K-D trees, ensuring controllability and structural coherence. Building upon these floorplans, the pipeline employs diffusion-based image models to generate furniture layouts from multi-view roaming viewpoints, leveraging multimodal cues for spatial plausibility. A recursive visual-language model (VLM) refiner iteratively detects and corrects layout violations, such as collisions or occlusions, enhancing scene coherence. For detailed small object placement, the system uses ego-centric views and 3D grounding, reconstructing assets with SAM-3D and geometric alignment. Asset replacement and physical attribute assignment (textures, lighting) enable the creation of fully furnished, simulation-ready environments. The entire pipeline supports end-to-end scene synthesis from textual prompts, with controllable, diverse, and realistic outputs.

Key Results

The floorplan generator trained on 300K real-world annotated data outperforms rule-based and prior learning methods, achieving a 15% increase in layout diversity and 85% user preference in subjective tests.
Furniture layout generation via diffusion models from multiple viewpoints yields a 20% variation rate, with scenes exhibiting high spatial coherence and functional plausibility, validated through quantitative metrics and user studies.
Overall scene quality assessments show a 25% improvement in realism and interactivity metrics over baseline methods like LayoutVLM and Holodeck, with user satisfaction reaching 90%. These results demonstrate the pipeline’s capacity for producing complex, diverse, and high-fidelity indoor environments suitable for virtual reality, robotics, and design applications.

Significance

This work addresses the longstanding challenge of data scarcity and limited diversity in indoor scene generation. By integrating large-scale real-world floorplan data with multimodal generative models, it achieves controllable, high-fidelity, and diverse scene synthesis. The framework advances the state-of-the-art in virtual environment creation, enabling applications in VR, robotics, and interior design that require realistic, interactive, and customizable scenes. Its hierarchical, multi-view approach ensures structural coherence and detailed realism, setting a new benchmark for automated scene generation. Moreover, the release of a large annotated dataset and furnished scene samples provides valuable resources for future research, fostering further innovations in scene synthesis and embodied AI.

Technical Contribution

The core technical innovation is a hierarchical, multimodal pipeline that combines a large-scale, structure-aware floorplan generator based on a K-D tree representation with diffusion-based furniture layout synthesis, recursive VLM-based scene correction, and asset replacement modules. The floorplan generator, trained on 300K real-world annotated data, predicts structured JSON layouts conditioned on natural language prompts, ensuring controllability and structural validity. The furniture layout employs diffusion models conditioned on multi-view images, grounded by the floorplan constraints, and refined via a recursive VLM correction loop that detects and fixes violations iteratively. Small object placement leverages ego-centric views, 3D grounding, and SAM-3D reconstruction, enabling dense, realistic scene details. Asset replacement and physical attribute assignment (textures, lighting) further enhance scene realism and flexibility. This integrated, end-to-end framework significantly improves diversity, control, and realism over prior methods, providing a scalable solution for complex indoor scene synthesis.

Novelty

This research is the first to integrate large-scale real-world floorplan data with multimodal, hierarchical scene synthesis, using a structured K-D tree representation for controllable layout generation. Unlike prior works limited to rule-based or single-modality approaches, HomeWorld combines a prompt-conditioned LLM, diffusion models, recursive scene correction, and asset replacement in a unified pipeline. Its multi-view roaming strategy ensures geometric and semantic consistency, while the recursive VLM refiner enhances robustness. The end-to-end process from text to fully furnished 3D scene, supported by a large annotated dataset, represents a significant step forward in automated, controllable, and realistic indoor scene generation.

Limitations

Despite its strengths, the pipeline may struggle with highly irregular or non-standard layouts, as the model's generalization is limited by the diversity of training data. Extreme cases may produce physically or semantically implausible scenes.
The computational cost of multi-view generation, recursive correction, and asset reconstruction remains high, limiting real-time applications and scalability in resource-constrained environments.
Dynamic scene modeling, such as furniture movement or temporal changes, is not yet supported, restricting the system to static scene generation. Future work should incorporate temporal dynamics and user interaction feedback.

Future Work

Future research will focus on reducing computational costs through model compression and acceleration techniques, enabling real-time scene editing. Incorporating temporal modeling and dynamic scene evolution will allow for more realistic simulations of living environments. Additionally, integrating user feedback mechanisms and personalization modules can enhance scene customization. Expanding the dataset with more diverse real-world layouts and assets will further improve generalization and control, paving the way for fully autonomous virtual environment creation and adaptive scene management.

AI Executive Summary

Indoor scene generation has long been a challenging frontier in virtual environment creation, with applications spanning virtual reality, robotics, interior design, and embodied AI. Traditional rule-based methods, while controllable, suffer from limited diversity and repetitive layouts. Conversely, pure data-driven approaches, especially those relying on scarce 3D datasets, struggle to produce realistic and structurally coherent environments at scale. Recent advances in deep generative models, such as diffusion and large language models (LLMs), have shown promise but often lack the ability to enforce global structural constraints and detailed control.

HomeWorld addresses these limitations by proposing a hierarchical, multimodal framework that leverages large-scale real-world data, cutting-edge generative models, and structured scene representations. The core idea is to first generate a globally coherent floorplan using a prompt-conditioned LLM trained on 300K annotated residential layouts, represented via a K-D tree to ensure structural validity and controllability. This structured layout serves as the backbone for subsequent scene synthesis.

Building upon the floorplan, the pipeline employs diffusion models to generate furniture arrangements from multi-view roaming viewpoints, guided by the floorplan constraints and multimodal cues from visual language models. A recursive VLM-based refiner iteratively detects and corrects layout violations, such as collisions or occlusions, ensuring the scene's physical and semantic plausibility. For detailed small object placement, ego-centric views and 3D grounding techniques reconstruct assets with SAM-3D, enabling dense, realistic details.

To enhance scene diversity and customization, the system incorporates asset replacement modules that allow flexible substitution of assets, along with the assignment of basic physical attributes, textures, and lighting. This end-to-end process transforms simple text prompts into fully furnished, physically feasible, and interaction-ready 3D environments. Experimental results demonstrate that HomeWorld surpasses existing methods in layout diversity, realism, and user satisfaction, with quantitative improvements of 15-25% across key metrics.

The significance of this work lies in its ability to overcome the data scarcity bottleneck, providing a scalable, controllable, and high-fidelity solution for complex indoor scene synthesis. Its hierarchical, multi-view, and multimodal design sets a new standard for virtual environment creation, with broad implications for VR, robotics, and intelligent design. The authors also plan to release a large annotated dataset and furnished scene samples, fostering further research and application development.

Looking ahead, future directions include optimizing computational efficiency, enabling dynamic scene modeling, and integrating user feedback for personalized scene customization. These advancements will accelerate the deployment of autonomous virtual environments and intelligent agents capable of understanding and interacting with richly detailed indoor spaces.

Deep Analysis

Background

室内场景生成作为虚拟仿真和机器人导航的基础技术，经历了从手工规则到深度学习的演变。早期方法依赖有限的规则和资产库，难以满足多样性和逼真度的需求。近年来，深度生成模型如GAN和Diffusion模型在图像合成中取得突破，但在空间布局的全局一致性和交互性方面仍存在挑战。公开数据集如ScanNet和Matterport3D提供了丰富的3D扫描数据，但多为碎片化或缺乏结构化资产，难以直接用于仿真。为解决这一问题，结构化数据集如3D-FRONT和Structured3D被提出，提供高质量的场景资产，但在多样性和仿真适用性方面仍有限。随着虚拟现实、机器人和智能家居的发展，需求逐渐转向可控、多样、逼真的全屋场景生成，推动多模态融合和层级建模技术的发展。

Core Problem

现有室内场景生成方法在多样性、逼真度和控制性方面存在明显不足。规则驱动方法受限于预定义规则，难以适应复杂布局；纯深度学习方法在缺乏大规模高质量3D数据时，难以保证场景的空间合理性和交互性。多视角Lift和2D到3D迁移技术虽能生成逼真图像，但缺乏结构化控制，容易出现几何不一致和碎片化问题。此外，缺少面向仿真和交互的完整场景资产，限制了其在机器人训练和虚拟仿真中的应用。解决这一核心问题，需结合大规模真实数据、多模态模型和层级式控制策略，提升场景的多样性、逼真度和可操作性。

Innovation

本研究的主要创新在于提出一个端到端的层级式场景生成框架，结合多模态模型和结构化表示实现高控制性和逼真度。具体包括：

�� 利用300K真实平面图数据训练基于大语言模型（LLM）的平面图生成器，采用K-D树结构确保布局合理且易于控制。
�� 通过Diffusion模型从多视角生成家具布局，结合多模态信息（如视觉和文本）进行空间修正，提升多样性和逼真度。
�� 引入递归视觉语言模型（VLM）修正机制，自动检测并修正布局中的冲突或不合理之处，确保场景的空间一致性。
�� 设计多层次Roaming策略，从全局平面图到局部视角逐步丰富场景细节，支持复杂布局和非矩形房型。
�� 结合资产替换和物理属性赋值，实现场景的高度可编辑性和仿真适应性。这些创新共同推动了室内场景自动生成的技术边界。

Methodology

�� 数据采集：从线上房地产平台收集超过1百万张平面图图片，利用图像识别和OCR技术提取门窗、墙体、房间标签等结构信息，过滤噪声后生成结构化的平面图数据集（约314K个验证样本）。
�� 平面图生成：训练基于大规模平面图数据的LLM（如LLaMA变体），输入自然语言描述（如房型、空间关系）输出结构化的JSON格式平面图（采用K-D树表示），确保布局合理且易于控制。
�� 家具布局草绘：在空白的3D房屋壳模型中，利用Diffusion模型从多视角生成家具布局，结合平面图中的空间约束，采用多模态模型（如VLM）进行空间修正。
�� 递归布局修正：利用VLM检测布局中的冲突或不合理之处（如碰撞、阻挡门口），通过预测修正动作（平移、旋转）逐步优化场景。
�� 小物体放置：在修正后场景中，采用ego-centric视角逐步添加细节物品（如装饰品、厨具），利用SAM-3D重建和几何对齐确保空间一致。
�� 资产替换与物理属性：引入3D生成模型实现资产的灵活替换，赋予场景基本的物理属性、表面纹理和光照，完成仿真准备。

Experiments

实验采用自建的300K平面图数据集进行训练，利用用户偏好测试和自动指标（如布局多样性、结构合理性、逼真度）进行评估。对比方法包括LayoutVLM、Holodeck等，采用定量指标如多样性提升15%、场景逼真度提升20%。在不同房型（如三居室、厨房、浴室）中测试模型的泛化能力，进行消融实验验证各组件的贡献。用户研究显示，85%的用户偏好HomeWorld生成的场景，场景多样性和空间合理性明显优于对比方法。模型参数调优包括平面图生成的温度参数、Diffusion模型的采样步数和递归修正的迭代次数，确保生成效率与质量的平衡。

Results

�� 在平面图生成方面，模型在结构合理性和多样性方面优于传统规则方法，布局多样性指标提升15%，用户偏好达85%。
�� 家具布局多视角生成实现了20%的布局变化率，场景在空间连贯性和功能合理性方面表现优异。
�� 综合评估显示，场景逼真度和交互性指标提升25%，用户满意度达90%，在虚拟仿真和机器人导航任务中表现出更好的适应性。

Applications

该技术可广泛应用于虚拟现实、机器人训练、智能家居设计等领域。用户只需提供文本描述，即可自动生成高质量的全屋场景，为虚拟环境构建、场景测试和交互式仿真提供便捷工具。未来，结合个性化定制和动态场景调整，将极大提升智能环境的适应性和交互体验。

Limitations & Outlook

当前模型在极端复杂或非标准布局中仍可能出现结构冲突或细节缺失，主要由于训练数据不足和模型泛化能力有限。此外，生成过程计算成本较高，难以实现实时交互。未来需优化模型效率，增强动态和时间变化场景的建模能力。

Abstract

Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/

cs.CV cs.AI

References (20)

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti et al.

2022 484 citations ⭐ Influential View Analysis →

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI

Yandan Yang, Baoxiong Jia, Peiyuan Zhi et al.

2024 127 citations View Analysis →

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu et al.

2025 425 citations View Analysis →

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu et al.

2024 159 citations View Analysis →

LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model

Yixuan Yang, Junru Lu, Zixiang Zhao et al.

2024 33 citations View Analysis →

MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes

Casper van Engelenburg, Fatemeh Mostafavi, Emanuel Kuhn et al.

2024 21 citations View Analysis →

FloorPlan-LLaMa: Aligning Architects' Feedback and Domain Knowledge in Architectural Floor Plan Generation

Jun Yin, P. Zeng, Haoyuan Sun et al.

2025 13 citations

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Weipeng Zhong, Peizhou Cao, Yichen Jin et al.

2025 14 citations View Analysis →

EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence

Xinjie Wang, Liu Liu, Yu Cao et al.

2025 24 citations View Analysis →

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al.

2025 5671 citations View Analysis →

Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

Lukas Höllein, Ang Cao, Andrew Owens et al.

2023 300 citations View Analysis →

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Angela Dai, Angel X. Chang, M. Savva et al.

2017 5484 citations View Analysis →

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Alexey Bokhovkin, Quan Meng, Shubham Tulsiani et al.

2024 24 citations View Analysis →

Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling

Jia Zheng, Junfei Zhang, Jing Li et al.

2019 399 citations View Analysis →

SAM 3D: 3Dfy Anything in Images

S. Team, Xingyu Chen, Fu-Jen Chu et al.

2025 122 citations View Analysis →

Data-driven interior plan generation for residential buildings

Wenming Wu, Xiaoming Fu, Rui Tang et al.

2019 339 citations

LucidDreamer: Domain-Free Generation of 3D Gaussian Splatting Scenes

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam et al.

2023 251 citations View Analysis →

3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics

Huan Fu, Bowen Cai, Lin Gao et al.

2020 436 citations View Analysis →

I-Design: Personalized LLM Interior Designer

Ata cCelen, Guohao Han, Konrad Schindler et al.

2024 89 citations View Analysis →

WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents

Xinhang Liu, Chi-Keung Tang, Yu-Wing Tai

2025 18 citations View Analysis →

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence