RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

TL;DR

Introducing RATS (Register Attention Transformers), which self-supervisedly discovers part-level structures with N learnable registers, achieving +12 mIoU on five segmentation benchmarks.

cs.CV 🔴 Advanced 2026-06-13 45 views

Timing Yang Predrag Neskovic Jansen Seheult Wenchao Han Anand Bhattad Alan Yuille Feng Wang

AI Reader Arxiv Page Download PDF

Computer Vision Transformer Self-supervised Learning Structured Representation Image Segmentation

Key Findings

Methodology

This paper presents RATS, a novel architecture built upon Vision Transformers, integrating a register attention mechanism that decomposes the [CLS] token into N learnable register tokens. Each transformer block incorporates a three-step attention process—compress, communicate, broadcast—that routes patch information through these registers. The registers are partitioned across attention heads, with each head owning an independent subset, fostering specialization. The entire system is trained via self-distillation using the DINO objective, without auxiliary labels or part annotations. During training, the registers spontaneously specialize into proto-semantic regions resembling object parts, as evidenced by similarity maps. The model's emergent part representations outperform baselines by +12 mIoU on five segmentation datasets, with notable improvements on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Additionally, the register dictionary exhibits cross-category semantic proximity and part-level consistency, supporting transfer and interpretability.

Key Results

RATS surpasses all baselines by an average of +12 mIoU across five segmentation benchmarks, with the highest gains on PartImageNet (up to 16.89 mIoU). It significantly outperforms previous methods like DINOv3, Superpixel, and Slot models, especially in part coherence and semantic grouping.
On downstream tasks, the register tokens serve as effective queries for Mask2Former, leading to improved semantic segmentation and object detection performance—outperforming DINO baseline on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m).
The learned register dictionary demonstrates part-level consistency and semantic proximity across categories, enabling zero-shot compositional generalization, as shown by the decomposition of unseen object combinations like a Pegasus (horse + bird wings).

Significance

This work addresses a fundamental challenge in unsupervised visual learning: how to discover meaningful, reusable parts without supervision. By enabling models to self-organize into part-aware structures, RATS advances interpretability and robustness in vision systems. Its architecture introduces a new paradigm where internal tokens (registers) act as emergent part representations, bridging the gap between global features and fine-grained semantics. This has profound implications for explainable AI, transfer learning, and structured scene understanding, potentially transforming how machines perceive and interpret complex visual environments.

Technical Contribution

The key technical innovation lies in embedding a register-based attention bottleneck within each transformer block, enabling explicit routing of patch information into N learnable tokens per head. This mechanism leverages the three-step attention process—compress (aggregation), communicate (inter-register exchange), and broadcast (dissemination)—to promote specialization. The partitioning of registers across heads encourages diverse regional focus, leading to emergent part discovery. The training with DINO self-distillation ensures that these registers develop meaningful semantic representations without supervision. Additionally, the method introduces a novel similarity map-based visualization for interpretability and a dataset-wide part dictionary that captures cross-category part relationships, supporting transfer and compositional generalization.

Novelty

This research is the first to integrate a register attention bottleneck into a Vision Transformer for unsupervised part discovery, emphasizing the emergent specialization of tokens into semantic regions. Unlike prior slot or superpixel methods, RATS does not rely on explicit supervision, object proposals, or text annotations. Its core novelty is the three-step attention routing within each head, which enables the model to self-organize into part-aware structures purely through self-distillation. This approach fundamentally shifts the paradigm from global feature aggregation to explicit, interpretable part-level representations.

Limitations

The number of registers (N) influences the granularity of parts; too many registers can lead to over-segmentation, while too few may miss finer details. Automating this selection remains an open challenge.
The method's effectiveness diminishes in highly cluttered or highly variable scenes where parts are less visually consistent, potentially reducing the clarity of emergent parts.
Computational complexity increases with the number of registers and heads, which may limit real-time applications or deployment on resource-constrained devices. Further optimization is needed for efficiency.

Future Work

Future directions include developing adaptive mechanisms for automatically tuning the number of registers based on scene complexity, extending the approach to video data for dynamic part tracking, and integrating multi-modal cues (e.g., language, audio) to enrich semantic part representations. Additionally, efforts to reduce computational costs and improve scalability will be crucial for real-world deployment. Exploring the integration of RATS with other self-supervised frameworks and applying it to 3D data or multimodal understanding are promising avenues for advancing structured perception.

AI Executive Summary

Understanding the internal structure of visual scenes remains a central challenge in computer vision. Traditional models, including convolutional neural networks and early transformers, excel at capturing global features but lack explicit mechanisms for discovering and representing the constituent parts of objects. This limitation hampers interpretability and generalization, especially in complex or unseen scenarios. Recent advances in self-supervised learning, such as DINO, have demonstrated that rich feature representations can be learned without labels, revealing cross-image semantic correspondences. However, these features remain implicit and lack explicit part-level organization.

The present work introduces RATS (Register Attention Transformers), a novel architecture designed to self-organize into part-aware representations without supervision. Building upon the transformer backbone, RATS incorporates a register attention mechanism that decomposes the [CLS] token into N learnable register tokens. These registers serve as intermediate region representations, routing patch information through a three-step process—compression, communication, and broadcasting—within each transformer block. This design encourages different attention heads to specialize in distinct regions, fostering the emergence of semantic parts.

Training is conducted solely via the DINO self-distillation objective, without any auxiliary losses or part annotations. The model learns to produce similarity maps between registers and patches, which reveal localized, semantically meaningful regions. Quantitative evaluations across five segmentation benchmarks show that RATS outperforms all baselines by an average of +12 mIoU, with particular strength in part coherence and cross-category generalization. The learned register dictionary captures part-level consistency and semantic proximity, supporting zero-shot transfer and compositional reasoning.

Beyond segmentation, the register tokens serve as effective queries for downstream tasks like semantic segmentation and object detection, further demonstrating their semantic richness. The approach opens new avenues for interpretable, structured visual representations, bridging the gap between global features and fine-grained parts. Its architectural simplicity, combined with strong empirical results, suggests that RATS provides a promising prior for future research in unsupervised, structured, and explainable vision models.

Looking ahead, potential developments include adaptive register allocation, extension to video and multimodal data, and efficiency improvements to enable real-time applications. Overall, RATS marks a significant step toward models that not only recognize objects but understand their internal composition, much like humans do.

Deep Dive

Glossary

Register (寄存器)

In this context, a learnable token that routes and represents a specific semantic region within an image; analogous to a region-specific feature vector in neural models.

Used to decompose patch information into meaningful parts.

Self-distillation (自蒸馏)

A training method where a model learns from its own predictions, typically using a teacher-student framework; in this paper, DINO employs self-distillation to learn rich features.

Guides the training of RATS without explicit labels.

Attention mechanism (注意力机制)

A process that computes weighted interactions between tokens, allowing models to focus on relevant parts; multi-head self-attention is a core component of transformers.

Used to route patch information through registers.

Part-level semantics (部件级语义)

Semantic representations corresponding to object parts, such as wings or heads, rather than whole objects; crucial for interpretability.

Emerges spontaneously in RATS without supervision.

Similarity map (相似性映射)

A visual or numerical representation indicating the similarity between register tokens and image patches, used to identify semantic regions.

Helps visualize emergent parts.

Proto-semantic region (原型语义区域)

An initial, emerging region that resembles a semantic part, which can further develop into a meaningful object part.

Specializes spontaneously in each register.

Cross-category proximity (跨类别邻近性)

Semantic closeness of parts across different object categories, indicating shared structures like wheels or wings.

Observed in the register dictionary.

Mask2Former (掩码变换器)

A transformer-based framework for dense prediction tasks like segmentation and detection, utilizing learnable queries.

Used with register tokens for downstream segmentation.

mIoU (平均交并比)

Mean Intersection over Union, a metric for segmentation quality measuring the overlap between predicted and ground truth regions.

Main evaluation metric.

AP^m (平均精度)

Average Precision for detection tasks, measuring the precision-recall trade-off at various thresholds.

Used in COCO detection evaluation.

PartImageNet (部件图像集)

A dataset designed for evaluating part segmentation and discovery, containing images with annotated parts.

Used for quantitative assessment.

Open Questions Unanswered questions from this research

1 尽管RATS在无监督条件下成功发现部件结构，但其在极端复杂或高变异类别中的表现仍有限，如何提升模型在多样化场景中的泛化能力仍是未来研究的重点。
2 目前模型主要在静态图像上验证，动态视频中的结构化表示和时序一致性仍未充分探索，未来需扩展到视频理解领域。
3 寄存器数量的选择对部件细粒度有显著影响，自动调节寄存器数目以适应不同场景的需求仍未实现，需开发自适应机制。
4 模型的计算成本较高，尤其在大规模寄存器和多头设置下，如何优化效率以满足实时应用需求，是未来的重要方向。
5 寄存器的语义表达虽然具有一定的可解释性，但在复杂场景中其语义一致性和跨类别的迁移能力仍需进一步验证和提升。

Applications

Immediate Applications

自动驾驶中的场景理解

利用RATS的部件分割能力，提升车辆对复杂环境中物体的识别和理解，增强安全性和鲁棒性。

机器人感知与交互

机器人可以通过寄存器识别环境中的关键部件，实现更精细的操作和交互，提升自主能力。

医学影像分析

在医学图像中自动发现器官或病变的结构部分，为诊断提供更直观的部件级别信息。

Long-term Vision

跨模态结构理解

结合文本、声音等多模态信息，构建更丰富的结构化场景理解体系，实现多模态的语义对齐和推理。

自主学习与推理

让模型自主发现和理解复杂场景中的结构关系，推动智能体在未知环境中的自主学习和推理能力。

Abstract

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L->N->N->L bottleneck via a three-step compress-communicate-broadcast attention. The N registers are partitioned across the H attention heads, so that registers assigned to different heads do not interact with each other. Without auxiliary losses or part annotations, each register spontaneously specializes into a proto-semantic region whose emerging structure resembles object parts. RATS surpasses all baselines by +12 mIoU on average across five segmentation benchmarks, with consistent gains on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Its register dictionary further exhibits part-level consistency and semantic proximity across related categories. Our results suggest that RATS may provide a useful architectural prior for structured and interpretable visual representation learning.

cs.CV

References (20)

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu et al.

2024 3437 citations ⭐ Influential View Analysis →

DINOv3

Oriane Sim'eoni, Huy V. Vo, Maximilian Seitzer et al.

2025 922 citations ⭐ Influential View Analysis →

Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

Ke Fan, Zechen Bai, Tianjun Xiao et al.

2024 32 citations ⭐ Influential View Analysis →

Segment Anything

A. Kirillov, Eric Mintun, Nikhila Ravi et al.

2023 13930 citations ⭐ Influential View Analysis →

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra et al.

2021 9334 citations ⭐ Influential View Analysis →

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee et al.

2019 116019 citations View Analysis →

The Hungarian method for the assignment problem

H. Kuhn

1955 14465 citations

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Feng Wang, Jieru Mei, Alan L. Yuille

2023 177 citations View Analysis →

Perceiver: General Perception with Iterative Attention

Andrew Jaegle, Felix Gimeno, Andrew Brock et al.

2021 1470 citations View Analysis →

SGDR: Stochastic Gradient Descent with Warm Restarts

I. Loshchilov, F. Hutter

2016 10606 citations View Analysis →

Momentum Contrast for Unsupervised Visual Representation Learning

Kaiming He, Haoqi Fan, Yuxin Wu et al.

2019 15089 citations View Analysis →

Unsupervised Representation Learning by Predicting Image Rotations

Spyros Gidaris, Praveer Singh, N. Komodakis

2018 3590 citations View Analysis →

RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models

Timing Yang, Guoyizhe Wei, Alan L. Yuille et al.

2025 1 citations View Analysis →

GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations

Martin Engelcke, Adam R. Kosiorek, Oiwi Parker Jones et al.

2019 338 citations View Analysis →

SimMIM: a Simple Framework for Masked Image Modeling

Zhenda Xie, Zheng Zhang, Yue Cao et al.

2021 1822 citations View Analysis →

Multi-Object Representation Learning with Iterative Variational Inference

Klaus Greff, Raphael Lopez Kaufman, Rishabh Kabra et al.

2019 574 citations View Analysis →

Superpixel Sampling Networks

V. Jampani, Deqing Sun, Ming-Yu Liu et al.

2018 283 citations View Analysis →

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac et al.

2021 844 citations View Analysis →

Layer Normalization

Jimmy Ba, J. Kiros, Geoffrey E. Hinton

2016 12665 citations View Analysis →

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang et al.

2021 1102 citations View Analysis →