Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

TL;DR

Proposes SEIG, a staged framework leveraging pretrained vision-language models (VLMs) to reconstruct editable 3D scenes from a single image, achieving high fidelity in geometry, materials, and lighting.

cs.CV 🔴 Advanced 2026-06-02 115 views

Guangzhao He Rundong Luo Wei-Chiu Ma Hadar Averbuch-Elor

AI Reader Arxiv Page Download PDF

inverse graphics vision-language models 3D reconstruction program synthesis staged optimization

Key Findings

Methodology

The proposed SEIG framework employs a multi-stage, progressive approach based on a pretrained VLM. Starting from a single input image, it constructs a coarse scene scaffold via scene graph decomposition, then iteratively refines scene factors—geometry, materials, layout, and lighting—through dedicated generator-verifier cycles. Each stage generates Blender scripts that modify the scene, which are then rendered and evaluated to guide subsequent refinements. This staged process reduces the complexity of the inverse graphics problem by isolating factors, allowing the VLM to focus on specific scene aspects sequentially. The framework does not rely on specialized 2D or 3D models, differentiable rendering, or multi-view supervision. Experimental evaluations on synthetic and real-world datasets, including NeRF and VoxHammer, demonstrate that staged reconstruction significantly outperforms monolithic approaches across pixel-level, perceptual, and semantic metrics, with PSNR reaching 13.58 and semantic similarity scores exceeding 0.84.

Key Results

On the NeRF synthetic dataset, SEIG achieved a PSNR of 13.58, LPIPS of 0.3433, DreamSim of 0.6293, and semantic similarity scores (DINO and CLIP) above 0.84, surpassing baseline methods such as VIGA full (PSNR 12.48) and VLM-only (PSNR 11.52). These quantitative results indicate superior geometric and perceptual fidelity.
Qualitative assessments show that SEIG produces structured, editable Blender scenes that closely match reference images in geometry, surface appearance, and scene layout. The reconstructed scenes support multi-view rendering, relighting, and object editing, confirming their practical utility.
Ablation studies reveal that the staged approach, combined with the verifier mechanism, is critical for achieving high-quality reconstructions, especially in complex scenes. The method demonstrates robustness across diverse scenarios, including occlusions and cluttered environments.

Significance

This work marks a significant advancement in single-image 3D scene reconstruction, bridging the gap between semantic understanding and geometric fidelity without requiring multi-view data or specialized models. By harnessing the semantic and reasoning capabilities of pretrained VLMs within a structured, staged pipeline, SEIG addresses longstanding challenges in inverse graphics, such as disentangling scene factors and ensuring scene editability. Its ability to generate high-quality, physically grounded, and editable 3D scenes from a single image opens new avenues for applications in virtual reality, gaming, film production, and robotics. Moreover, it demonstrates the untapped potential of large-scale pretrained models to serve as foundational tools for complex 3D reasoning tasks, reducing reliance on task-specific training and specialized datasets.

Technical Contribution

The core technical innovation of this paper lies in integrating a staged, multi-round generator-verifier pipeline with a pretrained VLM to perform executable inverse graphics. Unlike prior methods that attempt end-to-end optimization, SEIG decomposes the reconstruction into manageable, verifiable subproblems—geometry, materials, layout, and lighting—each addressed sequentially. This approach leverages scene graph decomposition, multiple scene sampling during initialization, and stage-specific verification to ensure each factor is accurately reconstructed. The use of Blender's Python API allows the generation of fully editable scene scripts, enabling downstream editing and physical simulation. The framework’s modular design facilitates interpretability, robustness, and scalability, setting a new standard for single-image 3D scene reconstruction without specialized geometric or rendering models.

Novelty

This research is pioneering in applying a pretrained VLM directly to the task of executable inverse graphics from a single image, employing a staged, factor-wise refinement process. Unlike existing methods such as NeRF or neural scene representations, which encode scenes in latent neural spaces, SEIG produces explicit, editable Blender scripts. Its staged approach, combined with a generator-verifier loop at each step, effectively manages the complexity of scene factors, leading to higher fidelity reconstructions. The method’s independence from specialized 2D/3D models and multi-view supervision distinguishes it from prior works, marking a significant step toward general-purpose, scalable inverse graphics solutions.

Limitations

Despite strong performance, SEIG struggles with scenes featuring extreme complexity or occlusion, where the VLM’s spatial reasoning is insufficient, leading to inaccuracies in geometry and material details. The staged process, while effective, incurs high computational costs, limiting real-time applicability. The reliance on input image quality means that noisy or occluded images can degrade reconstruction fidelity. Additionally, the current framework assumes static scenes and does not handle dynamic or multi-view inputs, which are common in real-world scenarios. Addressing these limitations requires further research into more efficient algorithms, multi-modal data integration, and dynamic scene modeling.

Future Work

Future directions include integrating depth and semantic cues to improve scene understanding, optimizing the pipeline for real-time performance, and extending the framework to dynamic scenes with temporal coherence. Exploring reinforcement learning strategies for adaptive stage-wise refinement could further enhance robustness. Additionally, combining the approach with generative adversarial networks or diffusion models may enrich scene details and realism. Developing multi-view extensions and real-world dataset annotations will be crucial for industrial deployment, enabling applications in autonomous navigation, AR/VR content creation, and intelligent robotics.

AI Executive Summary

Reconstructing detailed 3D scenes from a single image has long been a central challenge in computer vision and graphics. Traditional methods often require multiple views, extensive annotations, or specialized models, making the process labor-intensive and limited in generality. Recent advances in large-scale pretrained vision-language models (VLMs) have demonstrated remarkable capabilities in semantic understanding, code generation, and instruction following, inspiring researchers to explore their potential in 3D scene reasoning.

This paper introduces SEIG (Staged Executable Inverse Graphics), a novel framework that leverages pretrained VLMs to perform high-fidelity, editable 3D scene reconstruction from a single image. Unlike prior approaches that attempt to optimize all scene factors simultaneously, SEIG adopts a staged, factor-wise refinement strategy. It decomposes the scene into geometry, materials, layout, and lighting, handling each aspect sequentially through dedicated generator-verifier loops. This design simplifies the complex inverse problem, reduces ambiguities, and enhances interpretability.

The core idea is inspired by how human artists build scenes: starting with a rough sketch, then progressively refining details. The framework begins by generating a coarse scene scaffold based on scene graph decomposition, then iteratively refines each factor—adjusting object shapes, assigning realistic materials, arranging scene layout, and tuning lighting parameters. Each stage employs a generator to produce Blender scripts and a verifier to evaluate the rendered scene against the input image, guiding subsequent refinements. This process results in a fully editable Blender scene that faithfully reproduces the original image’s geometry, appearance, and lighting conditions.

Extensive experiments on synthetic datasets like NeRF and real-world images demonstrate that SEIG outperforms existing monolithic inverse graphics systems, achieving higher scores across pixel-level (PSNR, SSIM), perceptual (LPIPS, DreamSim), and semantic (DINO, CLIP) metrics. The staged approach significantly improves reconstruction fidelity, robustness, and scene consistency. Moreover, the generated scenes support downstream tasks such as multi-view rendering, relighting, and object editing, confirming their practical utility.

This work represents a major step toward autonomous, general-purpose scene understanding from minimal input. By harnessing the semantic reasoning power of VLMs within a structured, multi-stage pipeline, SEIG offers a scalable, interpretable, and highly effective solution to the longstanding challenge of inverse graphics. Its ability to produce physically grounded, editable 3D scenes from a single image opens new horizons for virtual content creation, immersive experiences, and intelligent scene manipulation. Future research will focus on extending this approach to dynamic scenes, multi-view inputs, and real-time applications, further bridging the gap between semantic understanding and geometric reconstruction in AI-driven graphics.

Deep Dive

Abstract

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.

cs.CV

References (20)

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Shaofeng Yin, Jiaxin Ge, Z. Wang et al.

2026 11 citations ⭐ Influential View Analysis →

Non-rigid Point Cloud Registration with Neural Deformation Pyramid

Yang Li, Tatsuya Harada

2022 71 citations View Analysis →

MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds

Bingquan Dai, L. Luo, Qihong Tang et al.

2025 10 citations View Analysis →

SAM 3D: 3Dfy Anything in Images

S. Team, Xingyu Chen, Fu-Jen Chu et al.

2025 116 citations View Analysis →

Volumetric Disentanglement for 3D Scene Manipulation

Sagie Benaim, Frederik Warburg, Peter Ebert Christensen et al.

2022 16 citations View Analysis →

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

Lin Li, Zehuan Huang, Hao-li Feng et al.

2025 32 citations View Analysis →

NeRF: Representing scenes as neural radiance ﬁelds for view synthesis

B. Mildenhall, Google Research, P. Srinivasan et al.

2881 citations

Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives

Tom Monnier, J. Austin, Angjoo Kanazawa et al.

2023 43 citations View Analysis →

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Parker Liu, Chenxin Li, Zhengxin Li et al.

2025 13 citations View Analysis →

Machine Perception of Three-Dimensional Solids

L. Roberts

1963 2038 citations

The Scene Language: Representing Scenes with Programs, Words, and Embeddings

Yunzhi Zhang, Zizhang Li, Matt Zhou et al.

2024 28 citations View Analysis →

Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

Long Le, Jason Xie, William Liang et al.

2024 83 citations View Analysis →

3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V

Dingning Liu, Xiaomeng Dong, Renrui Zhang et al.

2023 20 citations View Analysis →

Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering

Bangbang Yang, Yinda Zhang, Yinghao Xu et al.

2021 307 citations View Analysis →

Deep 3D Capture: Geometry and Reflectance From Sparse Multi-View Images

Sai Bi, Zexiang Xu, Kalyan Sunkavalli et al.

2020 106 citations View Analysis →

CSGNet: Neural Shape Parser for Constructive Solid Geometry

Gopal Sharma, Rishabh Goyal, Difan Liu et al.

2017 222 citations View Analysis →

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy et al.

2021 49552 citations View Analysis →

Extracting Triangular 3D Models, Materials, and Lighting From Images

Jacob Munkberg, J. Hasselgren, Tianchang Shen et al.

2021 494 citations View Analysis →

GS-IR: 3D Gaussian Splatting for Inverse Rendering

Zhihao Liang, Qi Zhang, Yingfa Feng et al.

2023 247 citations View Analysis →

Learning to reconstruct shape and spatially-varying reflectance from a single image

Zhengqin Li, Zexiang Xu, R. Ramamoorthi et al.

2018 320 citations

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence