Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models
Proposes SEIG, a staged framework leveraging pretrained vision-language models (VLMs) to reconstruct editable 3D scenes from a single image, achieving high fidelity in geometry, materials, and lighting.
Key Findings
Methodology
The proposed SEIG framework employs a multi-stage, progressive approach based on a pretrained VLM. Starting from a single input image, it constructs a coarse scene scaffold via scene graph decomposition, then iteratively refines scene factors—geometry, materials, layout, and lighting—through dedicated generator-verifier cycles. Each stage generates Blender scripts that modify the scene, which are then rendered and evaluated to guide subsequent refinements. This staged process reduces the complexity of the inverse graphics problem by isolating factors, allowing the VLM to focus on specific scene aspects sequentially. The framework does not rely on specialized 2D or 3D models, differentiable rendering, or multi-view supervision. Experimental evaluations on synthetic and real-world datasets, including NeRF and VoxHammer, demonstrate that staged reconstruction significantly outperforms monolithic approaches across pixel-level, perceptual, and semantic metrics, with PSNR reaching 13.58 and semantic similarity scores exceeding 0.84.
Key Results
- On the NeRF synthetic dataset, SEIG achieved a PSNR of 13.58, LPIPS of 0.3433, DreamSim of 0.6293, and semantic similarity scores (DINO and CLIP) above 0.84, surpassing baseline methods such as VIGA full (PSNR 12.48) and VLM-only (PSNR 11.52). These quantitative results indicate superior geometric and perceptual fidelity.
- Qualitative assessments show that SEIG produces structured, editable Blender scenes that closely match reference images in geometry, surface appearance, and scene layout. The reconstructed scenes support multi-view rendering, relighting, and object editing, confirming their practical utility.
- Ablation studies reveal that the staged approach, combined with the verifier mechanism, is critical for achieving high-quality reconstructions, especially in complex scenes. The method demonstrates robustness across diverse scenarios, including occlusions and cluttered environments.
Significance
This work marks a significant advancement in single-image 3D scene reconstruction, bridging the gap between semantic understanding and geometric fidelity without requiring multi-view data or specialized models. By harnessing the semantic and reasoning capabilities of pretrained VLMs within a structured, staged pipeline, SEIG addresses longstanding challenges in inverse graphics, such as disentangling scene factors and ensuring scene editability. Its ability to generate high-quality, physically grounded, and editable 3D scenes from a single image opens new avenues for applications in virtual reality, gaming, film production, and robotics. Moreover, it demonstrates the untapped potential of large-scale pretrained models to serve as foundational tools for complex 3D reasoning tasks, reducing reliance on task-specific training and specialized datasets.
Technical Contribution
The core technical innovation of this paper lies in integrating a staged, multi-round generator-verifier pipeline with a pretrained VLM to perform executable inverse graphics. Unlike prior methods that attempt end-to-end optimization, SEIG decomposes the reconstruction into manageable, verifiable subproblems—geometry, materials, layout, and lighting—each addressed sequentially. This approach leverages scene graph decomposition, multiple scene sampling during initialization, and stage-specific verification to ensure each factor is accurately reconstructed. The use of Blender's Python API allows the generation of fully editable scene scripts, enabling downstream editing and physical simulation. The framework’s modular design facilitates interpretability, robustness, and scalability, setting a new standard for single-image 3D scene reconstruction without specialized geometric or rendering models.
Novelty
This research is pioneering in applying a pretrained VLM directly to the task of executable inverse graphics from a single image, employing a staged, factor-wise refinement process. Unlike existing methods such as NeRF or neural scene representations, which encode scenes in latent neural spaces, SEIG produces explicit, editable Blender scripts. Its staged approach, combined with a generator-verifier loop at each step, effectively manages the complexity of scene factors, leading to higher fidelity reconstructions. The method’s independence from specialized 2D/3D models and multi-view supervision distinguishes it from prior works, marking a significant step toward general-purpose, scalable inverse graphics solutions.
Limitations
- Despite strong performance, SEIG struggles with scenes featuring extreme complexity or occlusion, where the VLM’s spatial reasoning is insufficient, leading to inaccuracies in geometry and material details. The staged process, while effective, incurs high computational costs, limiting real-time applicability. The reliance on input image quality means that noisy or occluded images can degrade reconstruction fidelity. Additionally, the current framework assumes static scenes and does not handle dynamic or multi-view inputs, which are common in real-world scenarios. Addressing these limitations requires further research into more efficient algorithms, multi-modal data integration, and dynamic scene modeling.
Future Work
Future directions include integrating depth and semantic cues to improve scene understanding, optimizing the pipeline for real-time performance, and extending the framework to dynamic scenes with temporal coherence. Exploring reinforcement learning strategies for adaptive stage-wise refinement could further enhance robustness. Additionally, combining the approach with generative adversarial networks or diffusion models may enrich scene details and realism. Developing multi-view extensions and real-world dataset annotations will be crucial for industrial deployment, enabling applications in autonomous navigation, AR/VR content creation, and intelligent robotics.
AI Executive Summary
Reconstructing detailed 3D scenes from a single image has long been a central challenge in computer vision and graphics. Traditional methods often require multiple views, extensive annotations, or specialized models, making the process labor-intensive and limited in generality. Recent advances in large-scale pretrained vision-language models (VLMs) have demonstrated remarkable capabilities in semantic understanding, code generation, and instruction following, inspiring researchers to explore their potential in 3D scene reasoning.
This paper introduces SEIG (Staged Executable Inverse Graphics), a novel framework that leverages pretrained VLMs to perform high-fidelity, editable 3D scene reconstruction from a single image. Unlike prior approaches that attempt to optimize all scene factors simultaneously, SEIG adopts a staged, factor-wise refinement strategy. It decomposes the scene into geometry, materials, layout, and lighting, handling each aspect sequentially through dedicated generator-verifier loops. This design simplifies the complex inverse problem, reduces ambiguities, and enhances interpretability.
The core idea is inspired by how human artists build scenes: starting with a rough sketch, then progressively refining details. The framework begins by generating a coarse scene scaffold based on scene graph decomposition, then iteratively refines each factor—adjusting object shapes, assigning realistic materials, arranging scene layout, and tuning lighting parameters. Each stage employs a generator to produce Blender scripts and a verifier to evaluate the rendered scene against the input image, guiding subsequent refinements. This process results in a fully editable Blender scene that faithfully reproduces the original image’s geometry, appearance, and lighting conditions.
Extensive experiments on synthetic datasets like NeRF and real-world images demonstrate that SEIG outperforms existing monolithic inverse graphics systems, achieving higher scores across pixel-level (PSNR, SSIM), perceptual (LPIPS, DreamSim), and semantic (DINO, CLIP) metrics. The staged approach significantly improves reconstruction fidelity, robustness, and scene consistency. Moreover, the generated scenes support downstream tasks such as multi-view rendering, relighting, and object editing, confirming their practical utility.
This work represents a major step toward autonomous, general-purpose scene understanding from minimal input. By harnessing the semantic reasoning power of VLMs within a structured, multi-stage pipeline, SEIG offers a scalable, interpretable, and highly effective solution to the longstanding challenge of inverse graphics. Its ability to produce physically grounded, editable 3D scenes from a single image opens new horizons for virtual content creation, immersive experiences, and intelligent scene manipulation. Future research will focus on extending this approach to dynamic scenes, multi-view inputs, and real-time applications, further bridging the gap between semantic understanding and geometric reconstruction in AI-driven graphics.
Deep Dive
Abstract
Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.
References (20)
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Shaofeng Yin, Jiaxin Ge, Z. Wang et al.
Non-rigid Point Cloud Registration with Neural Deformation Pyramid
Yang Li, Tatsuya Harada
MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds
Bingquan Dai, L. Luo, Qihong Tang et al.
SAM 3D: 3Dfy Anything in Images
S. Team, Xingyu Chen, Fu-Jen Chu et al.
Volumetric Disentanglement for 3D Scene Manipulation
Sagie Benaim, Frederik Warburg, Peter Ebert Christensen et al.
VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Lin Li, Zehuan Huang, Hao-li Feng et al.
NeRF: Representing scenes as neural radiance fields for view synthesis
B. Mildenhall, Google Research, P. Srinivasan et al.
Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives
Tom Monnier, J. Austin, Angjoo Kanazawa et al.
IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
Parker Liu, Chenxin Li, Zhengxin Li et al.
Machine Perception of Three-Dimensional Solids
L. Roberts
The Scene Language: Representing Scenes with Programs, Words, and Embeddings
Yunzhi Zhang, Zizhang Li, Matt Zhou et al.
Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model
Long Le, Jason Xie, William Liang et al.
3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V
Dingning Liu, Xiaomeng Dong, Renrui Zhang et al.
Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering
Bangbang Yang, Yinda Zhang, Yinghao Xu et al.
Deep 3D Capture: Geometry and Reflectance From Sparse Multi-View Images
Sai Bi, Zexiang Xu, Kalyan Sunkavalli et al.
CSGNet: Neural Shape Parser for Constructive Solid Geometry
Gopal Sharma, Rishabh Goyal, Difan Liu et al.
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy et al.
Extracting Triangular 3D Models, Materials, and Lighting From Images
Jacob Munkberg, J. Hasselgren, Tianchang Shen et al.
GS-IR: 3D Gaussian Splatting for Inverse Rendering
Zhihao Liang, Qi Zhang, Yingfa Feng et al.
Learning to reconstruct shape and spatially-varying reflectance from a single image
Zhengqin Li, Zexiang Xu, R. Ramamoorthi et al.