LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

TL;DR

LocateAnything uses Parallel Box Decoding and 138M samples to significantly boost visual grounding speed and accuracy.

cs.CV 🔴 Advanced 2026-05-27 84 views

Shihao Wang Shilong Liu Yuanguo Kuang Xinyu Wei Yangzhou Liu Zhiqi Li Yunze Man Guo Chen Andrew Tao Guilin Liu Jan Kautz Lei Zhang Zhiding Yu

AI Reader Arxiv Page Download PDF

Vision-Language Models Visual Grounding Parallel Decoding Large-scale Dataset Generative Models

Key Findings

Methodology

LocateAnything introduces a unified generative framework for visual grounding and detection based on Parallel Box Decoding (PBD). Unlike conventional vision-language models that serialize 2D bounding boxes into multiple 1D tokens decoded sequentially, PBD treats geometric elements such as bounding boxes and points as atomic units decoded in a single step. This preserves intra-box geometric coherence and unlocks substantial parallelism, overcoming the sequential bottleneck inherent in token-by-token decoding. The framework leverages a large-scale curated dataset, LocateAnything-Data, containing over 138 million diverse training samples, enhancing model generalization and localization precision. Extensive experiments across multiple benchmarks demonstrate that PBD improves both decoding throughput and high-IoU localization accuracy, validating the synergy between parallel decoding and large-scale data.

Key Results

LocateAnything achieves over 30% improvement in decoding speed on COCO and LVIS datasets, while increasing high-IoU (>0.7) localization accuracy by more than 5%, outperforming traditional token-by-token generation methods.
Training on the 138M-sample LocateAnything-Data significantly enhances model generalization, yielding a 4.3% AP increase on LVIS, particularly improving small object and complex background localization.
Ablation studies reveal that PBD reduces inference time by approximately 40% compared to sequential decoding, while maintaining or improving bounding box geometric consistency, confirming the effectiveness of parallel decoding.

Significance

LocateAnything addresses a fundamental bottleneck in vision-language models for visual grounding: the sequential tokenization and generation of bounding box coordinates that disrupt geometric coherence and limit inference speed. By introducing Parallel Box Decoding, the method preserves box geometry and enables substantial parallelism, significantly enhancing both accuracy and throughput. Coupled with a massive, diverse training dataset, LocateAnything advances the state-of-the-art in unified visual grounding and detection, facilitating real-time applications in autonomous driving, surveillance, and augmented reality. This work bridges a critical gap between model efficiency and precision, with broad implications for multimodal AI systems.

Technical Contribution

This work's technical contributions include: (1) the novel Parallel Box Decoding mechanism that treats bounding boxes as atomic units for simultaneous coordinate generation, overcoming the inefficiencies of token-by-token decoding and improving geometric coherence; (2) the development of a scalable data engine and the curation of the LocateAnything-Data dataset with over 138 million diverse training samples, substantially enhancing model robustness and generalization; (3) comprehensive experimental validation demonstrating the synergistic benefits of PBD and large-scale data, pushing the performance frontier of unified vision-language grounding and detection frameworks with practical engineering viability.

Novelty

LocateAnything is the first to systematically propose decoding 2D bounding boxes as atomic units in parallel rather than sequentially tokenizing coordinates. This fundamental shift preserves geometric structure within boxes and dramatically improves decoding efficiency. Unlike prior works that rely on token-by-token generation or end-to-end detection without explicit parallel decoding, LocateAnything uniquely combines parallel decoding with large-scale training data to achieve superior speed-accuracy trade-offs in visual grounding.

Limitations

Limitation 1: Performance degrades in scenarios with extreme occlusion and very small objects, primarily due to limited representation of such cases in the training data, constraining model generalization.
Limitation 2: The parallel decoding approach demands substantial hardware resources, especially GPU memory, posing challenges for deployment on resource-constrained devices.
Limitation 3: The current framework focuses on 2D bounding boxes and has not yet been extended to 3D spatial localization or more complex geometric shapes, limiting applicability in certain domains.

Future Work

Future directions include extending the Parallel Box Decoding mechanism to 3D visual grounding tasks, enhancing robustness to occlusion and small object detection, and optimizing model architectures to reduce computational and memory overhead. Additionally, expanding the training dataset to encompass more complex scenes and multimodal signals will further improve generalization. These efforts aim to facilitate broader real-world deployment and advance unified vision-language understanding.

AI Executive Summary

Vision-language models (VLMs) have revolutionized multimodal understanding by jointly processing visual and textual information. However, in visual grounding and detection tasks, conventional approaches typically serialize 2D bounding boxes into multiple 1D coordinate tokens, generating them sequentially. This token-by-token decoding disrupts the inherent geometric coherence within bounding boxes and imposes a strict sequential bottleneck on inference speed, limiting real-time applicability.

To overcome these challenges, LocateAnything proposes a unified generative framework based on Parallel Box Decoding (PBD). By decoding bounding boxes and keypoints as atomic units in a single step, PBD preserves intra-box geometric structure and unlocks significant parallelism, substantially accelerating inference without sacrificing accuracy. Complementing this architectural innovation, the authors curate LocateAnything-Data, a large-scale dataset comprising over 138 million diverse training samples, which enriches the model’s capacity to generalize across complex scenes and object categories.

The core technical insight of PBD lies in treating the four coordinates of a bounding box as a single decoding unit rather than independent tokens. This holistic approach maintains geometric consistency and allows multiple boxes to be decoded simultaneously, breaking the sequential generation bottleneck inherent in prior models. The framework leverages a Transformer-based architecture to implement this parallel decoding strategy effectively.

Extensive experiments on standard benchmarks such as COCO and LVIS demonstrate that LocateAnything achieves over 30% faster decoding speeds and improves high-IoU (>0.7) localization accuracy by more than 5% compared to traditional token-by-token methods. The large-scale training data further boosts performance, particularly in challenging scenarios involving small objects and cluttered backgrounds, with an AP increase of 4.3% on LVIS. Ablation studies confirm that PBD reduces inference time by approximately 40% while enhancing bounding box geometric integrity.

LocateAnything’s advances have significant implications for real-world applications requiring fast and precise visual grounding, including autonomous driving, intelligent surveillance, augmented reality, and robotics. By addressing the fundamental trade-off between decoding speed and localization accuracy, this work paves the way for more efficient and scalable vision-language systems.

Despite its strengths, LocateAnything faces limitations such as reduced robustness under extreme occlusion and small object detection, high hardware resource demands, and current restriction to 2D bounding boxes. Future work aims to extend PBD to 3D spatial localization, improve model efficiency, and expand training data diversity to further enhance performance and applicability. Overall, LocateAnything represents a pivotal step forward in unified visual grounding and detection, combining algorithmic innovation with large-scale data to redefine the speed-accuracy frontier.

Deep Analysis

Background

Vision-language models (VLMs) have emerged as a powerful paradigm for integrating visual and textual modalities, enabling tasks such as image captioning, visual question answering, and visual grounding. Early influential models like ViLBERT and UNITER established joint embedding spaces for vision and language, facilitating semantic alignment. In visual grounding and detection, the goal is to localize objects in images based on natural language queries. Traditionally, generative VLMs formulate bounding box prediction as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens representing coordinates. While effective, this approach introduces two critical issues: it breaks the geometric coherence of bounding boxes by treating coordinates independently, and the sequential token generation imposes inference speed bottlenecks.

Recent advances in object detection, such as DETR and its variants, have explored end-to-end detection frameworks with set-based prediction, but these methods do not explicitly address the sequential decoding inefficiency in vision-language grounding. Moreover, existing datasets for visual grounding are limited in scale and diversity, constraining model generalization to complex real-world scenarios. Thus, there remains a pressing need for methods that can jointly improve decoding efficiency and localization accuracy while leveraging large-scale diverse data.

Core Problem

The core problem addressed by LocateAnything is the inefficiency and geometric inconsistency arising from token-by-token decoding of bounding box coordinates in vision-language models. Specifically, the standard approach serializes a 2D bounding box into multiple discrete tokens (e.g., x_min, y_min, x_max, y_max), which are generated sequentially and largely independently. This process disrupts the intrinsic geometric relationships among coordinates, leading to suboptimal localization precision. Furthermore, the strictly sequential decoding limits parallelism during inference, resulting in slow decoding speeds unsuitable for real-time applications. Additionally, the scarcity of large-scale, diverse training data hampers model robustness and generalization. Addressing these intertwined challenges—maintaining geometric coherence, enabling parallel decoding, and leveraging extensive data—is crucial for advancing visual grounding performance.

Innovation

LocateAnything introduces several key innovations:

�� Parallel Box Decoding (PBD): Unlike traditional sequential token generation, PBD treats the entire bounding box as an atomic unit, decoding all coordinates simultaneously in a single step. This preserves geometric coherence within boxes and unlocks substantial parallelism, dramatically improving inference speed.

�� Large-scale Data Engine and LocateAnything-Data: The authors develop a scalable data curation pipeline, assembling a dataset with over 138 million training samples spanning diverse scenes and object categories. This unprecedented scale enhances model generalization and robustness.

�� Unified Generative Framework: LocateAnything models both visual grounding and detection as a unified generation task, simplifying architecture and training.

�� Systematic Experimental Validation: Extensive benchmarks and ablation studies demonstrate the complementary benefits of PBD and large-scale data, pushing the speed-accuracy frontier beyond prior art.

Methodology

Detailed methodology of LocateAnything:

�� Input Encoding: The model receives an image and a natural language query. A pre-trained visual encoder extracts dense image features, while a text encoder processes the language input.

�� Parallel Box Decoding (PBD): The core innovation where bounding boxes, defined by four coordinates (x_min, y_min, x_max, y_max), are decoded as a single atomic unit rather than separate tokens. This is implemented within a Transformer-based decoder that generates multiple boxes in parallel, maintaining intra-box geometric consistency.

�� Generative Framework: The model employs an autoregressive generation process at the box level, but within each decoding step, all coordinates of multiple boxes are generated simultaneously, enabling high throughput.

�� Training Data: Leveraging the LocateAnything-Data dataset with 138 million samples, the model learns from diverse, high-precision annotations covering varied object categories and complex scenes.

�� Loss Functions: The training optimizes a combination of bounding box regression loss (e.g., L1 loss, generalized IoU loss) and language alignment loss to ensure semantic and spatial accuracy.

�� Inference Optimization: The parallel decoding mechanism reduces sequential dependencies, allowing efficient batch processing and faster inference suitable for real-time applications.

Experiments

The experimental setup includes:

�� Datasets: Evaluation on COCO and LVIS benchmarks for visual grounding and detection, with training conducted on the large-scale LocateAnything-Data.

�� Baselines: Comparison against traditional token-by-token generation models and state-of-the-art end-to-end detectors such as DETR.

�� Metrics: Average Precision (AP), high IoU thresholds (>0.7) for localization accuracy, and decoding speed measured in frames per second (FPS).

�� Ablation Studies: Analysis of the impact of PBD versus sequential decoding, and the contribution of large-scale data to model performance.

�� Hyperparameters: Careful tuning of decoding step size, number of parallel boxes decoded, and model capacity to balance speed and accuracy.

Results

Key experimental findings include:

�� LocateAnything achieves over 30% faster decoding speeds on COCO and LVIS datasets compared to token-by-token baselines, with high-IoU localization accuracy improved by more than 5%.

�� Training on the 138M-sample LocateAnything-Data yields a 4.3% AP increase on LVIS, notably enhancing performance on small objects and cluttered backgrounds.

�� Ablation results show that PBD reduces inference time by approximately 40% while maintaining or improving bounding box geometric consistency.

�� The model demonstrates robust generalization across diverse visual grounding scenarios, outperforming prior methods in both speed and precision.

Applications

LocateAnything’s efficient and accurate visual grounding capabilities enable multiple practical applications:

�� Autonomous Driving: Real-time detection and localization of pedestrians, vehicles, and obstacles to enhance safety and navigation.

�� Intelligent Surveillance: Rapid identification and tracking of targets in video streams for security and anomaly detection.

�� Augmented Reality (AR): Precise object localization to enable natural language-driven interactions and immersive experiences.

�� Robotics: Multimodal perception combining vision and language for environment understanding and task execution.

�� Multimodal Search: Language-based image retrieval with accurate object localization, improving search relevance and speed.

Limitations & Outlook

LocateAnything has several limitations:

�� Performance drops under extreme occlusion and for very small objects, due to limited representation in training data, affecting robustness.

�� The parallel decoding approach requires substantial GPU memory and computational resources, challenging deployment on edge or resource-constrained devices.

�� The current framework is limited to 2D bounding boxes and does not yet support 3D spatial localization or complex geometric shapes, restricting applicability in certain domains.

Abstract

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

cs.CV cs.AI cs.LG cs.RO

References (20)

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng et al.

2024 711 citations ⭐ Influential View Analysis →

TiDAR: Think in Diffusion, Talk in Autoregression

Jingyu Liu, Xin Dong, Zhifan Ye et al.

2025 22 citations ⭐ Influential View Analysis →

Detect Anything via Next Point Prediction

Qing Jiang, Junan Huo, Xingyu Chen et al.

2025 41 citations ⭐ Influential View Analysis →

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

Sule Bai, Mingxing Li, Yong Liu et al.

2025 68 citations ⭐ Influential View Analysis →

Qwen2.5-VL Technical Report

Shuai Bai, Ke-qin Chen, Xuejing Liu et al.

2025 4706 citations ⭐ Influential View Analysis →

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee et al.

2024 215 citations ⭐ Influential View Analysis →

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You et al.

2025 671 citations ⭐ Influential View Analysis →

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Mu Cai, Haotian Liu, Siva Karthik Mustikovela et al.

2023 191 citations ⭐ Influential View Analysis →

Fast-dLLM v2: Efficient Block-Diffusion LLM

Chengyue Wu, Hao Zhang, Shuchen Xue et al.

2025 79 citations ⭐ Influential View Analysis →

Referring to Any Person

Qing Jiang, Lin Wu, Zhaoyang Zeng et al.

2025 18 citations ⭐ Influential View Analysis →

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen et al.

2025 1036 citations ⭐ Influential View Analysis →

CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

Yu Qi, Yumeng Zhang, Chenting Gong et al.

2025 1 citations ⭐ Influential View Analysis →

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Nvidia Alisson Azzolini, H. Brandon, Prithvijit Chattopadhyay et al.

2025 107 citations ⭐ Influential View Analysis →

Grounding Computer Use Agents on Human Demonstrations

Aarash Feizi, Shravan Nayak, Xiangru Jian et al.

2025 7 citations ⭐ Influential View Analysis →

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin et al.

2025 108 citations View Analysis →

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Haotian Zhang, Haoxuan You, Philipp Dufter et al.

2024 108 citations View Analysis →

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al.

2025 98 citations View Analysis →

Advancing LLM Reasoning Generalists with Preference Trees

Lifan Yuan, Ganqu Cui, Hanbin Wang et al.

2024 201 citations View Analysis →

Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

Weifeng Lin, Xinyu Wei, Ruichuan An et al.

2025 41 citations View Analysis →

PixelLM: Pixel Reasoning with Large Multimodal Model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei et al.

2023 255 citations View Analysis →

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence