Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization

TL;DR

This study systematically evaluates various vision-language models for country-level image geolocalization, revealing their limitations in capturing fine-grained geographic cues.

cs.CV 🔴 Advanced 2026-04-18 43 views

Siddhant Bharadwaj Ashish Vashist Fahimul Aleem Shruti Vyas

AI Reader Arxiv Page Download PDF

Vision-Language Models Geolocalization Zero-Shot Reasoning Semantic Reasoning Multimodal

Key Findings

Methodology

The study employs a unified benchmarking framework to evaluate multiple state-of-the-art Vision-Language Models (VLMs) for country-level image geolocalization. Using three geographically diverse datasets, the study focuses on ground-view images and employs carefully designed prompts for country prediction. Evaluation metrics include Top-1 and Top-5 accuracy, along with Environmental Stratification, Error Structure Analysis, and Geographic Error Reasonableness (GER) score.

Key Results

Result 1: The Qwen3-VL-4B model achieved a Top-1 accuracy of 74.79% on the GeoGuessr-50k dataset and 65.78% on the CityGuessr dataset, demonstrating its strong capability in country-level geolocalization.
Result 2: The study found that the Qwen3-VL-8B model underperforms compared to smaller models in certain scenarios, indicating that increasing parameter count does not always enhance geographic reasoning.
Result 3: Through the Geographic Error Reasonableness (GER) score, the study reveals visually reasonable error patterns, such as confusion between neighboring countries.

Significance

This study provides the first systematic comparison of modern Vision-Language Models for country-level geolocalization, laying the foundation for future research at the intersection of multimodal reasoning and geographic understanding. It highlights the limitations of current VLMs in capturing fine-grained geographic cues and emphasizes the potential of semantic reasoning for coarse geolocalization. This is crucial for developing more precise geographic reasoning models in the future.

Technical Contribution

Technical contributions include introducing a standardized evaluation protocol for prompt-based geolocalization, reducing confounding factors related to training and architectural modifications. Additionally, the study introduces the Geographic Error Reasonableness (GER) score, a novel metric for evaluating whether incorrect predictions are visually justified, revealing key failure patterns such as the inverted scaling phenomenon in the Qwen3-VL family.

Novelty

This study is the first to systematically compare modern Vision-Language Models in the task of country-level geolocalization. Unlike traditional retrieval-based geolocalization methods, this study employs prompt-based reasoning to directly infer the likely country of origin, pioneering a new paradigm for geolocalization.

Limitations

Limitation 1: The study reveals significant limitations of current VLMs in capturing fine-grained geographic cues, especially in distinguishing neighboring countries with similar visual features.
Limitation 2: The inverted scaling phenomenon observed in model performance suggests that increasing parameter count does not always enhance geographic reasoning, indicating that language decoding rather than visual representation may be the performance bottleneck.
Limitation 3: The datasets used in the study exhibit geographic bias, such as the Western/developed-country bias in the GeoGuessr-50k dataset.

Future Work

Future research directions include developing models that better capture fine-grained geographic cues and exploring how to enhance geographic reasoning by integrating multimodal data such as text and video. Additionally, studies can further analyze model performance differences across various geographic and cultural contexts to improve global applicability.

AI Executive Summary

Image geolocalization aims to determine the geographic location where a query image was captured, traditionally addressed through retrieval-based place recognition or geometry-based visual localization pipelines. However, these methods often require large, curated databases and complex matching procedures. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored.

This study systematically evaluates multiple state-of-the-art VLMs for country-level image geolocalization using three geographically diverse datasets, focusing on ground-view imagery. Instead of relying on image matching or task-specific training, the study evaluates prompt-based country prediction, revealing substantial variation across models and their limitations in capturing fine-grained geographic cues.

The results show significant differences in model performance, with the Qwen3-VL-4B model achieving a Top-1 accuracy of 74.79% on the GeoGuessr-50k dataset and 65.78% on the CityGuessr dataset. However, the Qwen3-VL-8B model underperforms compared to smaller models in certain scenarios, indicating that increasing parameter count does not always enhance geographic reasoning.

Additionally, the study introduces the Geographic Error Reasonableness (GER) score, a novel metric for evaluating whether incorrect predictions are visually justified, revealing key failure patterns such as confusion between neighboring countries. The study also finds that all models perform better in urban scenes than in rural ones, highlighting their advantage in capturing man-made landmarks and dense built environments.

This study lays the foundation for future research at the intersection of multimodal reasoning and geographic understanding, emphasizing the potential of semantic reasoning for coarse geolocalization. Future research directions include developing models that better capture fine-grained geographic cues and exploring how to enhance geographic reasoning by integrating multimodal data.

Deep Analysis

Background

Image geolocalization aims to determine the geographic location where a query image was captured. Traditional methods primarily rely on retrieval-based place recognition or geometry-based visual localization pipelines. These methods often require large, curated databases and complex matching procedures, limiting their applicability in dynamic and diverse environments. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, offering new possibilities for geolocalization tasks. VLMs are pretrained on large-scale image-text pairs, capable of implicitly encoding high-level semantic, cultural, architectural, and environmental cues indicative of geographic context. However, their performance in geographic inference remains underexplored, particularly in capturing fine-grained geographic cues.

Core Problem

The core problem is the unclear performance of current Vision-Language Models in country-level image geolocalization tasks, especially in capturing fine-grained geographic cues. Traditional geolocalization methods rely on explicit feature matching or metric learning, whereas VLMs use prompt-based reasoning to directly infer the likely country of origin. The challenge is to evaluate these models' intrinsic geospatial reasoning capabilities and reveal their performance differences across various geographic and cultural contexts.

Innovation

The core innovations of this study include:

1. Introducing a standardized evaluation protocol for prompt-based geolocalization, reducing confounding factors related to training and architectural modifications.

2. Proposing the Geographic Error Reasonableness (GER) score, a novel metric for evaluating whether incorrect predictions are visually justified.

3. Systematically evaluating multiple state-of-the-art VLMs, revealing their limitations in capturing fine-grained geographic cues and emphasizing the potential of semantic reasoning for coarse geolocalization.

4. Discovering the inverted scaling phenomenon in the Qwen3-VL family, indicating that increasing parameter count does not always enhance geographic reasoning.

Methodology

The study's methodology includes:

�� Using three geographically diverse datasets (GeoGuessr-50k, CityGuessr, and OSV5M), focusing on ground-view imagery.
�� Employing carefully designed prompts for country prediction, evaluating the models' zero-shot geographic reasoning capabilities.
�� Using Top-1 and Top-5 accuracy as primary evaluation metrics, along with Environmental Stratification, Error Structure Analysis, and Geographic Error Reasonableness (GER) score for multidimensional evaluation.
�� Evaluating nine multimodal vision-language models, with model sizes ranging from 1B to 8B parameters, all evaluated in their publicly released pretrained form without task-specific fine-tuning.

Experiments

The experimental design includes:

�� Evaluating on three datasets: GeoGuessr-50k, CityGuessr, and OSV5M, covering different image sources, geographic scales, and label spaces.
�� Evaluation metrics include Top-1 and Top-5 accuracy, Environmental Stratification, Error Structure Analysis, and Geographic Error Reasonableness (GER) score.
�� Using greedy decoding to ensure deterministic evaluation, with image preprocessing following each model's recommended inference pipeline.
�� Performing urban/rural classification and biome-level categorization using multiple annotators to mitigate dependence on any single representation model.

Results

Results analysis includes:

�� The Qwen3-VL-4B model achieved a Top-1 accuracy of 74.79% on the GeoGuessr-50k dataset and 65.78% on the CityGuessr dataset.
�� The study found that the Qwen3-VL-8B model underperforms compared to smaller models in certain scenarios, indicating that increasing parameter count does not always enhance geographic reasoning.
�� Through the Geographic Error Reasonableness (GER) score, the study reveals visually reasonable error patterns, such as confusion between neighboring countries.
�� All models perform better in urban scenes than in rural ones, highlighting their advantage in capturing man-made landmarks and dense built environments.

Applications

Application scenarios include:

�� Direct use in country-level image geolocalization tasks, especially in the absence of large-scale databases and complex matching procedures.
�� Enhancing geographic reasoning by integrating multimodal data such as text and video.
�� Improving global applicability by analyzing model performance differences across various geographic and cultural contexts.

Limitations & Outlook

Limitations & outlook include:

�� Current VLMs have significant limitations in capturing fine-grained geographic cues, especially in distinguishing neighboring countries with similar visual features.
�� The inverted scaling phenomenon observed in model performance suggests that language decoding rather than visual representation may be the performance bottleneck.
�� The datasets used in the study exhibit geographic bias, such as the Western/developed-country bias in the GeoGuessr-50k dataset.
�� Future research directions include developing models that better capture fine-grained geographic cues and exploring how to enhance geographic reasoning by integrating multimodal data.

Plain Language Accessible to non-experts

Imagine you're at a large international airport trying to guess which country you're in by observing your surroundings. You see some iconic buildings, local billboards, and people's clothing. These are important clues to help you determine your location. A vision-language model is like a super-smart traveler that can infer the location of an image by observing these details.

However, this model sometimes makes mistakes, especially when countries have similar architectural styles or natural landscapes. For example, the Alpine landscapes of Switzerland and Austria might confuse the model because they look very similar.

To improve the model's accuracy, researchers designed a new method called 'prompt-based reasoning,' which is like giving the model extra clues to better understand the information in the image. This method doesn't rely on complex databases or task-specific training but instead uses the model's own knowledge to make inferences.

In this way, the model can more quickly and accurately determine the location of an image, even when there are no clear geographic markers. It's like being at an airport without seeing any iconic buildings but still being able to guess your country by observing people's clothing and language.

ELI14 Explained like you're 14

Hey there! Have you ever played a game called GeoGuessr? In this game, you see a random street view image, and you have to guess which country it's from. Sounds cool, right?

Scientists are doing something similar. They use a super-smart program called a vision-language model to guess where an image was taken. This program is like a detective, looking at buildings, billboards, and natural landscapes to find clues.

But sometimes, this program makes mistakes, especially when countries look alike. For example, the mountain views in Switzerland and Austria can easily confuse the program.

To make this program smarter, scientists designed a new method called 'prompt-based reasoning.' It's like giving the program extra clues to help it understand the image better. This way, the program can guess the location faster and more accurately!

So next time you play GeoGuessr, imagine having a super-smart assistant helping you guess where the image was taken. Isn't that awesome?

Glossary

Vision-Language Models

Vision-Language Models are AI models capable of processing both visual and language information, typically pretrained on large-scale image-text pairs.

Used in this paper to evaluate their performance in geolocalization tasks.

Geolocalization

Geolocalization refers to the process of determining the location where an image was captured, often involving identifying geographic features and cultural cues.

Used in this paper to assess the geographic reasoning capabilities of vision-language models.

Zero-Shot Reasoning

Zero-Shot Reasoning is the ability of a model to apply itself to new tasks without specific task training.

Evaluated in this paper for country-level geolocalization performance.

Prompt-Based Reasoning

Prompt-Based Reasoning is a method of aiding models in reasoning by providing additional clues, without relying on task-specific training.

Used in this paper to enhance model geolocalization accuracy.

Geographic Error Reasonableness

Geographic Error Reasonableness is a metric for evaluating whether incorrect predictions are visually justified, considering visual similarity between neighboring countries.

Used in this paper to analyze model error patterns.

Environmental Stratification

Environmental Stratification involves classifying images based on environmental features (e.g., urban or rural) to evaluate model performance in different settings.

Used in this paper to analyze urban/rural performance differences.

Error Structure Analysis

Error Structure Analysis evaluates model performance by analyzing the geographic proximity of prediction errors.

Used in this paper to assess error patterns in geolocalization tasks.

Biome-Level Categorization

Biome-Level Categorization classifies images based on natural landscape features to evaluate model performance across different biomes.

Used in this paper to analyze biome performance differences.

Neighbor Hop Distance

Neighbor Hop Distance measures the geographic proximity of prediction errors, typically by counting border crossings.

Used in this paper to analyze error structure.

Inverted Scaling Phenomenon

Inverted Scaling Phenomenon refers to the observation that increasing model parameter count does not always enhance performance, suggesting language decoding may be a bottleneck.

Used in this paper to explain performance differences.

Open Questions Unanswered questions from this research

1 Open Question 1: Current vision-language models have significant limitations in capturing fine-grained geographic cues, especially in distinguishing neighboring countries with similar visual features. Models that better capture these details are needed.
2 Open Question 2: The inverted scaling phenomenon observed in model performance suggests that language decoding rather than visual representation may be the performance bottleneck. Further research is needed to understand the impact of language decoding on geographic reasoning.
3 Open Question 3: The datasets used in the study exhibit geographic bias, such as the Western/developed-country bias in the GeoGuessr-50k dataset. More representative datasets are needed to evaluate global applicability.
4 Open Question 4: How to enhance geographic reasoning by integrating multimodal data such as text and video remains an open question, especially in the absence of clear geographic markers.
5 Open Question 5: Model performance differences across various geographic and cultural contexts have not been fully analyzed, requiring further research to improve global applicability.
6 Open Question 6: The Geographic Error Reasonableness (GER) score as a novel metric, how it can be applied and extended to other tasks remains to be explored.
7 Open Question 7: How to improve model geographic reasoning capabilities without increasing computational complexity, especially in resource-constrained environments.

Applications

Immediate Applications

Country-Level Image Geolocalization

This technology can be used for country-level image geolocalization tasks, especially in the absence of large-scale databases and complex matching procedures.

Multimodal Data Integration

Enhancing geographic reasoning by integrating multimodal data such as text and video, applicable in scenarios requiring rapid and accurate localization.

Global Applicability Analysis

Analyzing model performance differences across various geographic and cultural contexts to improve global applicability, especially in diverse environments.

Long-term Vision

Smart City Planning

Utilizing model geographic reasoning capabilities for smart city planning, optimizing resource allocation and infrastructure development, overcoming current data limitations in urban planning.

Global Environmental Monitoring

Applied in global environmental monitoring, analyzing changes in natural landscapes through images, providing real-time environmental data support for sustainable development.

Abstract

Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic understanding.

cs.CV

References (20)

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

Guillaume Astruc, Nicolas Dufour, Ioannis Siglidis et al.

2024 44 citations ⭐ Influential View Analysis →

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu et al.

2023 8934 citations ⭐ Influential View Analysis →

GPT-4o System Card

OpenAI Aaron Hurst, Adam Lerer, Adam P. Goucher et al.

2024 3755 citations ⭐ Influential View Analysis →

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao et al.

2024 1375 citations ⭐ Influential View Analysis →

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, N. Savinov, Denis Teplyashin et al.

2024 3431 citations ⭐ Influential View Analysis →

On the location dependence of convolutional neural network features

Scott Workman, Nathan Jacobs

2015 123 citations

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization

Guopeng Li, Ming Qian, Gui-Song Xia

2024 47 citations View Analysis →

Learned Contextual Feature Reweighting for Image Geo-Localization

Hyo Jin Kim, Enrique Dunn, Jan-Michael Frahm

2017 244 citations

End-to-End Learning of Deep Visual Representations for Image Retrieval

Albert Gordo, Jon Almazán, Jérôme Revaud et al.

2016 562 citations View Analysis →

Cross-View Image Sequence Geo-localization

Xiaohan Zhang, Waqas Sultani, S. Wshah

2022 35 citations View Analysis →

UAV Pose Estimation using Cross-view Geolocalization with Satellite Imagery

Akshay Shetty, G. Gao

2018 51 citations View Analysis →

Adaptive-Attentive Geolocalization From Few Queries: A Hybrid Approach

G. Berton, Valerio Paolicelli, Carlo Masone et al.

2020 49 citations View Analysis →

RetinaFace: Single-shot Multi-level Face Localization in the Wild

847 citations

Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

Jonathan Roberts, Timo Lüddecke, Rehan Sheikh et al.

2023 43 citations View Analysis →

Ground-to-Aerial Image Geo-Localization With a Hard Exemplar Reweighting Triplet Loss

Sudong Cai, Yulan Guo, Salman Hameed Khan et al.

2019 140 citations

GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Yushuo Zheng, Jiangyong Ying, Huiyu Duan et al.

2025 2 citations View Analysis →

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

Relja Arandjelović, Petr Gronát, A. Torii et al.

2015 3096 citations View Analysis →

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang et al.

2023 1908 citations View Analysis →

Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models

Zhijie Tan, Xu Chu, Weiping Li et al.

2024 11 citations View Analysis →

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

Pouya Pezeshkpour, Estevam Hruschka

2023 226 citations View Analysis →

Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Vision-Language Models

Geolocalization

Zero-Shot Reasoning

Prompt-Based Reasoning

Geographic Error Reasonableness

Environmental Stratification

Error Structure Analysis

Biome-Level Categorization

Neighbor Hop Distance

Inverted Scaling Phenomenon

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Country-Level Image Geolocalization

Multimodal Data Integration

Global Applicability Analysis

Long-term Vision

Smart City Planning

Global Environmental Monitoring

Abstract

References (20)

Related Papers

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

Learn&Drop: Fast Learning of CNNs based on Layer Dropping

SS3D: End2End Self-Supervised 3D from Web Videos

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock