SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models
Introduces SOCO benchmark with over 1 million keypoint pairs across 100 categories, revealing that vision foundation models encode strong semantics but poorly transfer correspondences across categories.
Key Findings
Methodology
This work constructs a hierarchical semantic concept framework and standardized keypoint annotations across 100 categories, totaling over 1 million correspondence pairs. The taxonomy distinguishes three types of relationships: concept matching (CC), within-object correspondence (SOC), and cross-category correspondence (Cross-SOC). Using zero-shot feature similarity matching—primarily cosine similarity—models such as CLIP, DINO, and iBOT are evaluated on their ability to align object parts across instances and categories. The evaluation employs metrics like PCK (Percentage of Correct Keypoints) and accuracy in both intra- and inter-category settings. The dataset supports multi-level analysis, from local semantic concepts to geometric configurations, enabling a comprehensive assessment of models’ spatial and semantic understanding.
Key Results
- All evaluated models show high performance on concept matching (CC), with DINOv2 reaching 78.9%, but performance drops significantly on object-internal correspondence (SOC) to around 55.5%, and even further on cross-category matching (Cross-SOC) to about 23.9%, indicating a gap in geometric and semantic transfer capabilities.
- Large vision-language models (LVLMs), such as Qwen-8B, outperform pure vision models in text-guided part localization (accuracy up to 30.8%), yet still lag in cross-image matching tasks, exposing a gap between language-grounded localization and fine-grained visual correspondence.
- SOC scores correlate strongly with downstream dense tasks like segmentation, tracking, and 3D pose estimation, surpassing traditional ImageNet classification as a diagnostic metric, demonstrating the benchmark’s practical relevance.
Significance
This research pioneers a hierarchical, semantic-aware benchmark that addresses the limitations of existing datasets by emphasizing cross-category, multi-level correspondence evaluation. It reveals that current models excel at recognizing semantic concepts but struggle with geometric and spatial reasoning, especially across different object classes. The findings highlight the necessity for models to incorporate structured spatial and semantic hierarchies, which are crucial for real-world applications like robotics, augmented reality, and autonomous systems. By providing a standardized, scalable evaluation framework, SOCO facilitates systematic progress toward models with genuine spatial understanding, bridging the gap between high-level recognition and detailed spatial reasoning.
Technical Contribution
The core technical contribution lies in designing a hierarchical taxonomy for semantic correspondence, defining three relation types (concept, within-object, cross-category), and establishing a large-scale, multi-category dataset with consistent keypoint annotations and textual descriptions. The evaluation methodology combines feature similarity matching with zero-shot inference, enabling a comprehensive analysis of model capabilities across semantic and geometric dimensions. The study also systematically compares multiple state-of-the-art models, revealing their strengths and weaknesses in spatial reasoning, and introduces new metrics to quantify geometric awareness and semantic abstraction.
Novelty
This work is the first to formalize a hierarchical taxonomy for semantic object correspondence, integrating semantic concepts, object-relative positions, and cross-category relations into a unified framework. It extends beyond existing datasets like SPair-71k and MISC210K by emphasizing semantic consistency, multi-level evaluation, and language descriptions. The combination of a large, diverse dataset with a structured taxonomy and zero-shot evaluation strategy represents a significant advancement in benchmarking spatial understanding in vision models.
Limitations
- The dataset and evaluation focus primarily on static images, lacking temporal or dynamic scene analysis, which limits insights into motion-based correspondence and video understanding.
- Current matching strategies rely heavily on feature similarity, which can be affected by noise, occlusion, or domain shifts, necessitating more robust algorithms.
- Semantic hierarchy definitions, while structured, still have ambiguities, especially in complex or ambiguous cases, requiring further refinement with external knowledge bases.
Future Work
Future directions include expanding the dataset to dynamic scenes and videos, integrating external knowledge graphs to refine semantic hierarchies, and developing more robust matching algorithms that incorporate contextual and temporal information. Additionally, exploring the integration of active learning and human-in-the-loop annotation could improve label quality and diversity. Extending the benchmark to include more complex tasks like scene graph generation and reasoning will further push the boundaries of spatial and semantic understanding in AI models.
AI Executive Summary
In recent years, deep learning models have revolutionized visual understanding, achieving remarkable success in tasks like image classification, object detection, and semantic segmentation. Models such as ResNet, Vision Transformer (ViT), DINO, and CLIP have demonstrated impressive generalization capabilities across large-scale datasets. However, these models primarily excel at recognizing global categories and often lack a nuanced understanding of object structures and spatial relationships. This gap becomes critical in applications requiring detailed spatial reasoning, such as robotic manipulation, augmented reality, and autonomous navigation.
Traditional benchmarks like ImageNet focus on category recognition accuracy, which does not adequately reflect a model’s ability to understand the internal parts and spatial configurations of objects. Recognizing this limitation, recent research has introduced semantic correspondence (SC) tasks, which evaluate how well models can match object parts across different instances and categories. Existing datasets such as PF-PASCAL, SPair-71k, and MISC210K have contributed valuable insights but suffer from limited category diversity, ambiguous annotations, and lack of hierarchical semantic structures.
Addressing these gaps, the present study introduces SOCO, a comprehensive benchmark designed to evaluate structured, part-level understanding in vision models. The core idea is to organize object parts into a hierarchical taxonomy of semantic concepts, enabling precise, cross-category matching. The dataset encompasses 100 diverse categories, including animals and man-made objects, with over 1 million keypoint pairs annotated consistently across instances. These annotations include not only spatial positions but also textual descriptions, facilitating multimodal evaluation.
The evaluation framework employs feature similarity matching, primarily cosine similarity, to perform zero-shot correspondence detection. Using models like DINO, CLIP, and iBOT, the study assesses their ability to recognize semantic parts within and across categories. Results reveal that while models perform well on concept-level matching, their geometric and spatial understanding—especially in cross-category scenarios—remains limited. For example, DINOv2 achieves 78.9% on concept matching but drops to 55.5% on within-object correspondence and further to 23.9% on cross-category matching.
Importantly, the study finds that the SOC metric correlates strongly with downstream dense tasks such as segmentation, tracking, and 3D pose estimation, outperforming traditional classification metrics. This underscores the importance of structured spatial understanding for practical AI applications. Furthermore, large vision-language models like Qwen-8B demonstrate improved text-guided part localization, yet still face challenges in cross-image matching, exposing a gap between language grounding and visual spatial reasoning.
This work significantly advances the field by providing a structured, scalable, and multimodal evaluation platform. It highlights the need for models to incorporate hierarchical semantic and geometric reasoning, moving beyond mere category recognition. The findings motivate future research to develop more robust, context-aware, and dynamic models capable of understanding complex object structures in real-world scenarios. The authors plan to extend SOCO to video and dynamic environments, integrate external knowledge bases, and explore active learning strategies to refine annotations, aiming to push the boundaries of spatial and semantic AI understanding in the coming years.
Deep Analysis
Background
近年来,深度学习模型在视觉理解领域取得了巨大突破,特别是在图像分类、目标检测和语义分割等任务中。代表性模型如ResNet、ViT、DINO、CLIP等,推动了大规模数据驱动的泛化能力。然而,这些模型主要关注全局类别识别,对于对象内部结构、局部关系和空间配置的理解仍然有限。早期的语义对应(SC)研究通过PF-PASCAL、SPair-71k和MISC210K等数据集,推动了局部匹配技术的发展,但在类别多样性、语义一致性和跨类别匹配方面存在不足。尤其是在复杂场景中,模型对对象局部结构的理解还远未达到人类水平,限制了其在机器人导航、增强现实等应用中的表现。随着多模态模型的兴起,研究逐渐关注模型在多模态空间理解中的潜力,但缺乏系统性、层次化的评估工具,难以全面衡量模型的空间结构理解能力。
Core Problem
当前深度模型在细粒度空间理解方面表现不足,尤其是在跨类别、多实例、多层次语义匹配任务中。传统的评估指标如分类准确率无法反映模型在空间结构和语义关系上的深度理解。现有数据集缺乏层次化、语义一致的标注体系,难以系统评估模型在不同语义层级和类别间的对应能力。这导致模型在复杂环境中的空间推理和对象关系理解受限,影响其在机器人、自动驾驶和增强现实等领域的应用。解决这一问题,需要建立具有层次结构、跨类别、多场景、多任务能力的评估体系,推动模型在空间理解上的突破。
Innovation
本研究的主要创新在于提出了层次化的语义对象对应(SOC)框架,明确区分概念匹配、对象内对应和跨类别对应三类关系,建立了标准化的标注体系。通过设计规模庞大的多类别、多层次标注数据集,结合多模态模型(如CLIP、DINO、iBOT)在特征空间中的匹配策略,实现了对模型空间理解能力的全面评估。引入一致的关键点标注和语义描述,支持跨类别、跨实例的细粒度匹配,为模型诊断提供了新工具。该体系突破了以往只关注几何或类别识别的局限,为模型在复杂场景中的空间推理提供了新方向。
Methodology
- �� 数据集构建:采集100个类别的图像,利用Amazon Mechanical Turk进行关键点标注,确保标注的语义一致性和层次结构。• 语义概念定义:建立层次化的语义体系,定义概念、对象内对应和跨类别对应关系,确保标注的统一性和可扩展性。• 特征提取:采用预训练的多模态模型(如CLIP、DINO、iBOT)提取图像特征,确保特征空间的语义丰富性。• 点匹配:利用余弦相似度在特征空间中进行点对点匹配,结合阈值(如PCK)判断匹配的正确性。• 零样本推理:无需微调模型,通过特征相似性实现跨实例和跨类别的匹配,验证模型的空间和语义理解能力。• 评估指标:采用PCK(正确关键点百分比)和匹配准确率,评估模型在不同复杂度任务中的表现。
Experiments
- �� 数据集:包括100个类别,标注超过1百万个关键点,支持跨类别和对象内匹配。• 模型:评估多种先进模型,包括DINO家族、CLIP、iBOT、MAE等,比较其在CC、SOC和Cross-SOC任务中的性能。• 评估策略:采用零样本匹配,通过特征相似度进行点匹配,计算PCK指标,分析不同类别和复杂度场景下的表现差异。• 子集分析:划分不同语义层次和几何复杂度子集,揭示模型在空间结构和语义理解上的优势与不足。
Results
- �� 所有模型在概念匹配(CC)任务中表现优异(如DINOv2达78.9%),但在对象内对应(SOC)任务中显著下降(55.5%),跨类别对应(Cross-SOC)表现更差(23.9%),显示几何和语义理解仍有差距。• 多模态模型在文本引导的细粒度定位任务中表现优于纯视觉模型(Qwen-8B在描述引导下准确率达30.8%),但在跨图像匹配方面仍存在明显差距。• SOC指标与下游任务(如分割、跟踪、3D姿态)高度相关,优于传统的ImageNet分类指标,验证了其作为模型理解能力的有效指标。
Applications
- �� 机器人导航:提升自主系统在复杂环境中的空间感知和目标追踪能力。• 增强现实:实现虚拟对象与真实场景的精准空间匹配,增强用户体验。• 自动驾驶:增强对复杂场景中对象结构和关系的理解,提高安全性和决策能力。未来,该基准还可应用于多模态交互、虚拟现实等领域,推动智能系统的空间理解能力提升。
Limitations & Outlook
- �� 目前标注体系主要基于静态图像,未考虑动态场景中的对象运动和变化,限制了模型在视频环境中的应用。• 模型匹配策略主要依赖特征空间相似度,可能受噪声、遮挡和域偏差影响,需引入更鲁棒的匹配机制。• 跨类别语义层次定义仍存在模糊空间,未来应结合知识图谱和外部语义资源进行优化。
Plain Language Accessible to non-experts
想象你在一家大型工厂工作,工厂里有许多不同的机器和工具。每个工具都有自己的名字、用途和放置位置。有时候,工厂会引进新机器,或者调整工具的位置,你需要不断学习和记忆这些工具的特征和位置。工厂的管理系统就像模型的理解能力,它要知道每个工具的名字(语义概念)、它在工厂中的位置(几何关系),以及不同工具之间的关系(跨类别匹配)。如果这个系统能准确识别和匹配这些工具,就像模型能理解对象的细节和关系一样,工厂的运作就会变得更顺畅。这项研究就像是在教工厂的管理系统变得更聪明,能更好地理解和操作各种工具,从而让整个工厂变得更智能、更高效。
ELI14 Explained like you're 14
想象你在学校的图书馆里,有很多不同的书,每本书都有自己的名字和放置位置。有时候,你需要帮朋友找到一本特定的书,比如一本关于动物的书,或者一本关于汽车的书。你会根据书的名字、封面或者它在书架上的位置来找到它。现在,假设你要帮朋友找到书,你们可以用描述(比如“红色封面、关于汽车的书”)来沟通。这个过程就像模型在学习如何通过描述找到对应的对象。研究中的模型就像是一个超级聪明的图书管理员,它不仅知道每本书的名字,还知道每本书在书架上的具体位置,甚至可以在不同的书架之间找到相似的书。这个研究就是在教计算机变得像这个超级图书管理员一样聪明,能理解书的内容和位置,从而更好地帮我们找到想要的东西。
Glossary
Semantic Correspondence (语义对应)
指在不同对象或实例之间找到具有相似语义意义的对应点或部分,反映对象的结构和功能关系。
用于评估模型在细粒度空间理解中的能力。
Keypoint (关键点)
在图像中标记的具有明确语义意义的点,用于描述对象的局部结构。
作为标注和匹配的基础元素。
PCK (正确关键点百分比)
衡量关键点匹配准确率的指标,表示预测点在一定误差范围内的比例。
评估模型点对点匹配性能。
Layered Semantic Concept (层次化语义概念)
将对象的语义信息组织成多层次结构,从粗到细描述对象的不同语义层级。
构建标注体系和模型训练的基础。
Zero-shot Matching (零样本匹配)
在没有专门训练的情况下,通过特征相似性实现对象或部分的匹配。
评估模型泛化能力的重要手段。
Multimodal Foundation Models (多模态基础模型)
融合视觉、文本等多种模态信息,具有跨模态理解和推理能力的深度模型。
如CLIP、DINO、iBOT等。
Cross-category (跨类别)
涉及不同类别对象之间的匹配或关系,考察模型的语义抽象能力。
在本研究中用于评估模型跨类别的空间理解。
Hierarchical Taxonomy (层次化分类体系)
将对象和概念按照层次结构组织,反映其语义和功能关系。
用于定义和标注语义对应关系。
Dense Self-supervised Learning (密集自监督学习)
通过在大量未标注数据上学习局部特征,增强模型的空间理解能力。
如DINO、MAE等模型的训练策略。
Feature Similarity (特征相似度)
衡量两个特征向量之间相似程度的指标,常用余弦相似度。
用于点对点匹配。
Open Questions Unanswered questions from this research
- 1 模型在动态场景中的空间理解能力仍有限,如何结合时间和运动信息,提升模型在视频中的空间对应能力,是未来的重要研究方向。
- 2 多模态模型在复杂环境下的鲁棒性不足,尤其在遮挡、多对象交互和背景干扰条件下,如何增强模型的空间和语义理解能力,仍待解决。
- 3 跨类别语义层次定义存在模糊,未来应结合知识图谱和更丰富的语义层次结构,提升模型的语义抽象和泛化能力。
- 4 模型匹配策略主要依赖特征空间的相似度,受噪声和偏差影响较大,未来需要引入更鲁棒的匹配机制和学习策略。
- 5 数据集虽规模庞大,但仍缺乏动态、多模态、多任务的联合标注,未来应结合多源信息,丰富标注内容,提升模型的空间理解深度。
Applications
Immediate Applications
机器人导航与操作
利用SOCO评估模型在复杂环境中的空间理解能力,提升自主机器人在未知场景中的导航、目标识别和交互能力。
增强现实与虚拟现实
通过细粒度空间匹配,实现虚拟对象与真实场景的精准融合,增强用户沉浸感和交互体验。
自动驾驶系统
提升车辆对复杂场景中对象结构和关系的理解能力,增强环境感知和决策的准确性。
Long-term Vision
智能场景理解与推理
结合层次化语义空间理解,实现智能系统在复杂环境中的推理、规划和决策能力,推动自动化和智能化发展。
跨模态知识融合
结合知识图谱和多模态数据,构建更丰富的空间和语义理解体系,推动多模态AI的泛化和自主能力。
Abstract
Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.
References (20)
ImageNet3D: Towards General-Purpose Object-Level 3D Understanding
Wufei Ma, Guanning Zeng, Guofeng Zhang et al.
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, Timothée Darcet, Théo Moutakanni et al.
SPair-71k: A Large-scale Benchmark for Semantic Correspondence
Juhong Min, Jongmin Lee, J. Ponce et al.
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy et al.
Scene Parsing through ADE20K Dataset
Bolei Zhou, Hang Zhao, Xavier Puig et al.
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, R. Socher et al.
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun et al.
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu, Yushi Hu, Bangzheng Li et al.
Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence
Junyi Zhang, Charles Herrmann, Junhwa Hur et al.
Can Visual Foundation Models Achieve Long-term Point Tracking?
Görkay Aydemir, Weidi Xie, Fatma Güney
Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space
Leonhard Sommer, Olaf Dünkel, C. Theobalt et al.
SIFT Flow: Dense Correspondence across Different Scenes
Ce Liu, J. Yuen, A. Torralba et al.
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron, Hugo Touvron, Ishan Misra et al.
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu, Haodong Duan, Yuanhan Zhang et al.
NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations
V. Jampani, Kevis-Kokitsi Maninis, Andreas Engelhardt et al.
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, A. Blattmann, Dominik Lorenz et al.
Indoor Segmentation and Support Inference from RGBD Images
N. Silberman, Derek Hoiem, Pushmeet Kohli et al.
Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence
Grace Luo, Lisa Dunlap, Dong Huk Park et al.
Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation
He Wang, Srinath Sridhar, Jingwei Huang et al.
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li, Dongxu Li, Caiming Xiong et al.