3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

Key Findings

Methodology

3DCity-LLM employs a coarse-to-fine feature encoding strategy with three parallel branches for target object, inter-object relationship, and global scene. It is trained on the 3DCity-LLM-1.2M dataset, which includes approximately 1.2 million high-quality samples across seven task categories, from fine-grained object analysis to multi-faceted scene planning. A multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment ensures accurate evaluations.

Key Results

3DCity-LLM significantly outperforms existing state-of-the-art methods in two benchmarks, with improvements ranging from 0.50 to 8.40 in BLEU-4, 1.07 to 10.69 in METEOR, and 0.16 to 1.51 in reliability.
Across seven task categories, 3DCity-LLM excels in complex tasks such as object analysis, relationship computation, and scene planning, demonstrating robust perception and understanding in complex urban environments.
Ablation studies confirm the effectiveness of the coarse-to-fine feature encoding strategy, particularly in enhancing spatial reasoning capabilities when handling large-scale urban scenes.

Significance

The introduction of 3DCity-LLM opens new avenues for applying multimodal large language models in 3D city-scale environments. It not only demonstrates potential in spatial reasoning and urban intelligence within academia but also provides technical support for industry applications such as urban planning and intelligent transportation. The 3DCity-LLM-1.2M dataset fills a gap in existing datasets by providing rich training resources with explicit 3D spatial information.

Technical Contribution

3DCity-LLM's technical contributions lie in its innovative coarse-to-fine feature encoding strategy, which integrates object-level geometry, inter-object relationship topology, and global scene semantics into a shared embedding space. Additionally, the study introduces a multi-dimensional evaluation protocol that combines traditional text-similarity metrics with LLM-based semantic assessments, ensuring comprehensive evaluations of open-ended city-scale tasks.

Novelty

3DCity-LLM is the first to extend multimodal large language models to 3D city-scale scenes, addressing limitations of existing models in handling complex urban environments through its innovative feature encoding strategy and large-scale high-quality dataset. Compared to existing methods, 3DCity-LLM offers significant advantages in modeling object relationships and understanding global scenes.

Limitations

3DCity-LLM may face computational resource constraints when processing real-time city scenes, particularly in handling large-scale urban data in real-time.
In extremely complex urban environments, the model may overlook certain details, leading to incomplete understanding.
Although the dataset is high-quality, there may still be insufficient data for certain specific scenarios, affecting the model's generalization capabilities.

Future Work

Future research could focus on enhancing 3DCity-LLM's real-time processing capabilities and expanding the dataset to cover more diverse urban scenarios. Additionally, integrating other advanced visual and language model technologies to further enhance spatial reasoning and scene understanding capabilities is an important direction.

AI Executive Summary

In the realm of multimodal large language models, significant progress has been made in object-centric or indoor scenarios, yet scaling these models to 3D city-scale environments remains a formidable challenge. Existing models often lack comprehensive understanding of inter-object relationships and global scenes when handling complex urban environments.

To address this issue, the research team proposes 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. This framework employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, the study introduces the 3DCity-LLM-1.2M dataset, which includes approximately 1.2 million high-quality samples across seven task categories, from fine-grained object analysis to multi-faceted scene planning.

The core technical principle of 3DCity-LLM lies in its innovative feature encoding strategy, which integrates object-level geometry, inter-object relationship topology, and global scene semantics into a shared embedding space. Through task-driven instruction tuning, 3DCity-LLM can handle diverse tasks ranging from fine-grained object analysis to complex scene analysis and goal-oriented planning.

In experiments, 3DCity-LLM significantly outperforms existing state-of-the-art methods in two benchmarks, demonstrating robust perception and understanding capabilities in complex urban environments. Particularly in complex tasks such as object analysis, relationship computation, and scene planning, 3DCity-LLM excels.

The significance of this study lies not only in its demonstration of potential in spatial reasoning and urban intelligence within academia but also in providing technical support for industry applications such as urban planning and intelligent transportation. By introducing the 3DCity-LLM-1.2M dataset, the study fills a gap in existing datasets by providing rich training resources with explicit 3D spatial information.

However, 3DCity-LLM may face computational resource constraints when processing real-time city scenes, particularly in handling large-scale urban data in real-time. Future research could focus on enhancing the model's real-time processing capabilities and expanding the dataset to cover more diverse urban scenarios.

Deep Analysis

Background

In recent years, multimodal large language models (MLLMs) have rapidly transformed the field of artificial intelligence, demonstrating unprecedented capabilities in reasoning, generation, and multi-modality integration. Existing models such as ChatGPT-5, Qwen3, and LLaVA-Plus have shown that language-centric architectures can be adapted for cross-modality understanding. However, these models excel primarily in small-scale or object-centric scenarios, and their potential in 3D city-scale environments remains largely unexplored. Diverse city environments introduce a new level of complexity for multi-modality perception and understanding. Unlike indoor benchmarks that involve a limited number of objects, a city scene usually contains thousands of entities with heterogeneous attributes and intricate spatial relationships.

Core Problem

The core problem of multimodal perception and understanding in 3D city-scale scenes involves numerous heterogeneous objects and their complex spatial relationships. Existing multimodal large language models often lack comprehensive understanding of inter-object relationships and global scenes when handling such large-scale environments. Answering queries like 'Which hospital is closest to the railway station? And where is its emergency department located?' requires understanding object categories, precise spatial coordinates, relational proximity, and city scene layout. These tasks highlight the need for a unified framework that can simultaneously perform 3D object perception, relationship calculation, and holistic scene understanding.

Innovation

3DCity-LLM's core innovations include its coarse-to-fine feature encoding strategy and large-scale high-quality dataset. First, the model employs three parallel branches to achieve feature encoding for target objects, inter-object relationships, and global scenes, addressing limitations of existing models in handling complex urban environments. Second, the introduction of the 3DCity-LLM-1.2M dataset provides rich training resources, covering seven task categories from fine-grained object analysis to multi-faceted scene planning. Additionally, the study proposes a multi-dimensional evaluation protocol, combining traditional text-similarity metrics with LLM-based semantic assessments, ensuring comprehensive evaluations of open-ended city-scale tasks.

Methodology

�� 3DCity-LLM employs a coarse-to-fine feature encoding strategy, comprising three parallel branches for target object, inter-object relationship, and global scene.

�� It is trained on the 3DCity-LLM-1.2M dataset, which includes approximately 1.2 million high-quality samples across seven task categories, from fine-grained object analysis to multi-faceted scene planning.

�� A multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment ensures accurate evaluations.

�� Through task-driven instruction tuning, 3DCity-LLM can handle diverse tasks ranging from fine-grained object analysis to complex scene analysis and goal-oriented planning.

Experiments

In the experimental design, two benchmarks were used to validate the performance of 3DCity-LLM. By comparing with existing state-of-the-art methods, the study demonstrates significant improvements in BLEU-4, METEOR, and reliability metrics. The experiments also include ablation studies to confirm the effectiveness of the coarse-to-fine feature encoding strategy, particularly in enhancing spatial reasoning capabilities when handling large-scale urban scenes.

Results

The experimental results show that 3DCity-LLM significantly outperforms existing state-of-the-art methods in two benchmarks, with improvements ranging from 0.50 to 8.40 in BLEU-4, 1.07 to 10.69 in METEOR, and 0.16 to 1.51 in reliability. Particularly in complex tasks such as object analysis, relationship computation, and scene planning, 3DCity-LLM excels, demonstrating robust perception and understanding capabilities in complex urban environments.

Applications

Application scenarios for 3DCity-LLM include urban planning, intelligent transportation, and urban safety. By providing comprehensive understanding of city-scale scenes, the model can support decision-making in urban planning, optimize traffic flow, and enhance the efficiency of urban safety monitoring. In intelligent transportation, 3DCity-LLM can be used for real-time traffic flow analysis and route planning, improving traffic efficiency.

Limitations & Outlook

Despite 3DCity-LLM's excellent performance in multiple tasks, it may face computational resource constraints when processing real-time city scenes, particularly in handling large-scale urban data in real-time. Additionally, in extremely complex urban environments, the model may overlook certain details, leading to incomplete understanding. Future research could focus on enhancing the model's real-time processing capabilities and expanding the dataset to cover more diverse urban scenarios.

Plain Language Accessible to non-experts

Imagine 3DCity-LLM as a super city tour guide. As you walk around the city, wondering where the nearest hospital is or the history of a particular building, 3DCity-LLM acts like an all-knowing guide, quickly answering your questions. It not only tells you which hospital is closest but also provides detailed descriptions of each building's features and their relationships. Like a giant city map, it can see the location, shape, and distance of every building. In this way, 3DCity-LLM helps us better understand and plan cities, acting like a smart city brain.

ELI14 Explained like you're 14

Hey there, imagine you're playing a super cool city simulation game, and you need to know where every building is and the best places to visit. 3DCity-LLM is like your game assistant, quickly telling you everything you want to know! For example, you want to know where the nearest hospital is or which park is best for a picnic. 3DCity-LLM is like a super-smart city guide, helping you find the answers. It can see the whole city's layout and knows every building's details, just like you see in the game. Isn't that awesome?

Glossary

3DCity-LLM

A unified framework designed for 3D city-scale vision-language perception and understanding, employing a coarse-to-fine feature encoding strategy.

Used for handling large-scale urban scene multimodal tasks.

Multimodal Large Language Model (MLLM)

Large language models that integrate multiple modalities such as text, images, and 3D data for understanding and generation.

Used for cross-modality understanding and task execution.

Coarse-to-fine Feature Encoding

A feature encoding strategy that extracts features for target objects, inter-object relationships, and global scenes in a hierarchical manner.

Used for feature extraction in 3DCity-LLM.

3DCity-LLM-1.2M Dataset

A dataset containing approximately 1.2 million high-quality samples, covering seven task categories to support 3DCity-LLM training.

Used for large-scale training and evaluation of 3DCity-LLM.

BLEU-4

A metric for evaluating the similarity between generated text and reference text, commonly used in machine translation and text generation tasks.

Used to evaluate the quality of 3DCity-LLM's generation.

METEOR

A text similarity evaluation metric that combines morphological, synonym, and word order information, commonly used in natural language processing tasks.

Used to evaluate the quality of 3DCity-LLM's generation.

Task-driven Instruction Tuning

Adjusting the model's behavior through specific task instructions, enabling it to adapt to diverse task requirements.

Used for task execution in 3DCity-LLM.

Inter-object Relationship Topology

The topological structure describing spatial relationships between objects, including adjacency, containment, and orientation.

Used for relationship modeling in 3DCity-LLM.

Global Scene Semantics

Semantic understanding of the entire scene, including object composition, spatial layout, and contextual cues.

Used for scene understanding in 3DCity-LLM.

Multi-dimensional Evaluation Protocol

Combines text similarity metrics and LLM-based semantic assessments to ensure comprehensive evaluations of tasks.

Used to evaluate the performance of 3DCity-LLM.

Open Questions Unanswered questions from this research

1 Existing multimodal large language models often lack comprehensive understanding of inter-object relationships and global scenes when handling 3D city-scale environments. This is because most models are primarily trained in small-scale or object-centric scenarios, lacking support from large-scale urban data. Future research needs to develop larger, more diverse datasets to support model training and evaluation in complex urban environments.
2 Although 3DCity-LLM performs well in multiple tasks, it may face computational resource constraints when processing real-time city scenes. Particularly in handling large-scale urban data in real-time, existing computational capabilities may not be sufficient to support efficient model operation. Future research needs to develop more efficient computational methods to support real-time applications of the model.
3 In extremely complex urban environments, 3DCity-LLM may overlook certain details, leading to incomplete understanding. This is because the model may not have sufficiently covered all possible urban scenarios during training. Future research needs to expand the dataset to cover more diverse urban scenarios.
4 Existing evaluation metrics, such as BLEU and METEOR, may not be sufficient to comprehensively evaluate 3DCity-LLM's performance in complex urban tasks. This is because these metrics primarily focus on text similarity, neglecting semantic understanding and reasoning capabilities. Future research needs to develop new evaluation methods to more comprehensively assess model performance.
5 Although 3DCity-LLM has potential in urban planning and intelligent transportation, its effectiveness in practical applications still needs further verification. This is because the model's performance under laboratory conditions may differ from real-world application scenarios. Future research needs to conduct more field tests to verify the model's practical application effectiveness.

Applications

Immediate Applications

Urban Planning

3DCity-LLM can be used in urban planning decision-making, helping planners better understand city layouts and object relationships to develop more reasonable planning schemes.

Intelligent Transportation

By providing comprehensive understanding of city-scale scenes, 3DCity-LLM can be used for real-time traffic flow analysis and route planning, improving traffic efficiency.

Urban Safety

3DCity-LLM can be used in urban safety monitoring, identifying potential safety hazards through comprehensive understanding of city scenes, enhancing urban safety levels.

Long-term Vision

Smart City

3DCity-LLM can serve as a core technology for smart cities, supporting intelligent management and operation of cities, improving overall efficiency and quality of life for residents.

Virtual Reality City Simulation

By combining virtual reality technology, 3DCity-LLM can be used for city simulation and training, helping planners and managers better understand and manage cities.

Abstract

While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.

cs.CV cs.AI

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

3DCity-LLM

Multimodal Large Language Model (MLLM)

Coarse-to-fine Feature Encoding

3DCity-LLM-1.2M Dataset

BLEU-4

METEOR

Task-driven Instruction Tuning

Inter-object Relationship Topology

Global Scene Semantics

Multi-dimensional Evaluation Protocol

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Urban Planning

Intelligent Transportation

Urban Safety

Long-term Vision

Smart City

Virtual Reality City Simulation

Abstract

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock