UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

TL;DR

UNIEGO employs proxy-mediated hierarchical distillation from nine heterogeneous teachers to unify egocentric video representations, achieving state-of-the-art results.

cs.CV 🔴 Advanced 2026-06-19 23 views

Wenhao Chi Arkaprava Sinha Dominick Reilly Hieu Le Srijan Das

AI Reader Arxiv Page Download PDF

video understanding knowledge distillation multi-modal learning proxy models egocentric videos

Key Findings

Methodology

This paper introduces a hierarchical multi-teacher distillation framework, where proxy models serve as intermediaries to translate diverse teacher knowledge into a homogeneous egocentric embedding space. The first level involves independent distillation from nine teachers spanning different modalities, viewpoints, and foundation models, into representation-specific proxies. These proxies mitigate architectural and geometric incompatibilities. The second level employs Selective Proxy Distillation (SPD), which adaptively filters proxies based on prediction correctness and confidence for each sample, ensuring reliable supervision. Initialization of the unified encoder is achieved via a learned convex combination of proxy parameters, stabilizing training. The entire framework integrates multi-view (ego-exo), multi-modal (RGB, depth, skeleton), and foundation models (e.g., DINOv2, SigLIP, ST-GCN). Extensive experiments on three benchmarks demonstrate significant improvements over naive multi-teacher baselines, with the model excelling in action recognition, retrieval, and segmentation tasks.

Key Results

On egocentric action recognition benchmarks (EgoExo-Fitness, Assembly101, EgoExo4D), UNIEGO achieves 84.7%, 50.7%, and 41.1% accuracy respectively, surpassing naive distillation and previous SOTA by margins of +2.9% to +4.6%.
In video retrieval, UNIEGO's mAP scores are 0.543, 0.253, and 0.182 on the respective datasets, outperforming baseline methods, indicating richer feature representations.
For temporal action segmentation on Assembly101, UNIEGO's F1@50 reaches 12.3, outperforming naive distillation (9.8), demonstrating better temporal boundary detection.

Significance

This work advances the field of egocentric video understanding by addressing the core challenge of integrating heterogeneous multi-source knowledge into a single, expressive model. The proxy-mediated hierarchical distillation framework effectively alleviates gradient conflicts and feature incompatibilities, enabling richer and more discriminative representations. Such a unified model is crucial for real-world applications like augmented reality, assistive robotics, and intelligent surveillance, where resource-efficient, robust, and comprehensive understanding is essential. The approach also opens new avenues for multi-modal, multi-view learning, setting a new benchmark for future research.

Technical Contribution

The paper's key technical innovation lies in the design of a two-stage distillation process with representation-specific proxies and a sample-wise selection mechanism. The proxies serve as structured mediators, transforming heterogeneous teacher signals into a unified egocentric space, thus resolving architectural and geometric incompatibilities. The second stage, SPD, dynamically filters proxies based on correctness and confidence, reducing conflicting gradients and improving robustness. The proxy merging initialization, based on convex combination, ensures a stable starting point in the loss landscape, facilitating effective training. The comprehensive experimental validation across multiple tasks and datasets demonstrates the framework's versatility and superiority over existing methods.

Novelty

This work is the first to systematically incorporate multi-modal, multi-view, and foundation model knowledge into a single egocentric encoder via proxy-mediated hierarchical distillation. Unlike prior approaches that assume homogeneous teachers or rely on simple ensemble methods, UNIEGO explicitly models the heterogeneity through structured proxies and adaptive sample-wise filtering. This approach effectively addresses the longstanding issues of gradient conflicts and feature incompatibilities in multi-source knowledge transfer, representing a significant leap forward in the field.

Limitations

The reliance on multiple proxy models increases computational and storage costs, which may hinder deployment in resource-constrained environments. Future work should explore model compression and efficiency improvements.
The framework assumes the availability of high-quality heterogeneous teachers; in scenarios with poor teacher performance or missing modalities, the effectiveness may diminish.
Current validation is limited to specific tasks like action recognition, retrieval, and segmentation; extending to multi-task learning or real-time applications requires further adaptation and optimization.

Future Work

Future research could focus on developing more efficient proxy architectures, possibly leveraging self-generating proxies or meta-learning strategies to reduce overhead. Additionally, extending the framework to multi-task learning scenarios, such as simultaneous action recognition and captioning, could enhance its versatility. Incorporating self-supervised pretraining for proxies and the unified encoder may further improve robustness, especially in unlabeled or noisy environments. Lastly, optimizing inference speed and resource efficiency will be critical for deploying this technology in real-time systems.

AI Executive Summary

Understanding human actions from egocentric videos remains a fundamental challenge in computer vision, with applications spanning augmented reality, assistive robotics, and activity analysis. Traditional approaches often rely on single-view, single-modality models, which are inherently limited by the narrow perspective and occlusion issues of wearable cameras. While recent efforts have incorporated auxiliary signals such as depth, skeleton data, and external viewpoints, these methods typically treat each modality or view separately, leading to fragmented representations.

The core difficulty lies in the heterogeneity of models trained on different modalities, viewpoints, and foundation models. These models often have incompatible architectures and feature geometries, making direct knowledge fusion problematic. Naive multi-teacher distillation methods tend to produce conflicting gradients, resulting in suboptimal training and limited representation richness.

Addressing this, Wenhao Chi et al. propose UNIEGO, a novel hierarchical distillation framework that leverages proxy models as structured mediators. The first stage involves training representation-specific proxies to convert heterogeneous teacher signals into a unified egocentric embedding space. These proxies serve as bridges, alleviating architectural and geometric incompatibilities. The second stage employs Selective Proxy Distillation (SPD), which dynamically filters proxies based on their prediction correctness and confidence for each training sample, ensuring only reliable supervision guides the student model.

This two-level approach is further stabilized by initializing the unified encoder as a convex combination of proxy parameters, placing it in a well-conditioned region of the loss landscape. The entire framework integrates nine teachers covering multiple modalities, viewpoints, and foundation models like DINOv2, SigLIP, and ST-GCN, trained on datasets such as EgoExo-Fitness, Assembly101, and EgoExo4D.

Experimental results demonstrate that UNIEGO achieves state-of-the-art performance across three key tasks: action recognition, video retrieval, and action segmentation. For instance, on EgoExo-Fitness, it reaches 84.7% accuracy, outperforming naive distillation by over 3%. In retrieval, it improves mAP scores significantly, indicating richer feature representations. The model also exhibits robustness across different backbone architectures, including lightweight models with only 22 million parameters.

This work marks a significant step forward in multi-source knowledge fusion for egocentric video understanding. By structuring the knowledge transfer through proxies and adaptive filtering, UNIEGO effectively mitigates the conflicts and incompatibilities that have hampered previous methods. Its ability to produce richer, more discriminative representations opens new horizons for practical applications in AR, robotics, and beyond. Future directions include optimizing computational efficiency, extending to multi-task learning, and exploring self-supervised pretraining to further enhance robustness and generalization.

Deep Analysis

Background

Egocentric video understanding has become a vital area of研究，随着AR、VR和智能机器人等应用的兴起，研究者不断探索如何从第一人称视角中提取丰富的动作和场景信息。早期工作如EgoVLP和LaViLa主要关注单模态特征学习，试图在有限视角内捕获动作的本质。然而，由于自我中心摄像头的运动性、遮挡和视角限制，单一模态模型难以全面理解场景。为了弥补这一不足，研究者引入多模态（如深度、骨架）和多视角（egocentric与exocentric）信息，利用同步采集的外部视角或传感器数据增强模型能力。代表性工作如ViewpointRosetta利用扩散模型实现视角映射，EgoDTM通过深度蒸馏学习3D感知特征。这些方法在一定程度上缓解了视角和模态的限制，但仍面临异构模型架构不兼容、特征空间不匹配的问题，限制了多源信息的深度融合。

Core Problem

核心问题在于如何有效融合来自不同模态、视角和基础模型的异构知识，构建一个单一、丰富的自我中心表示。传统多教师蒸馏方法多假设教师模型架构一致或特征空间兼容，但在实际中，骨架模型、场景模型和基础模型如DINOv2、SigLIP等架构差异巨大，导致梯度冲突和优化困难。此外，异构模型的特征几何差异使得直接蒸馏效果不佳，难以充分利用多源信息的互补性。解决这一问题的关键在于设计一种结构化的知识中介机制，既能缓解模型架构差异，又能动态筛选可靠的知识源，从而提升自我中心表示的丰富性和判别性。

Innovation

本研究的创新点主要包括：

�� 引入代理模型作为异构教师知识的中介，将不同模态、视角和基础模型的知识转换为统一的自我中心空间，有效缓解模型架构和特征几何不兼容的问题。
�� 设计两级蒸馏策略：第一层通过代理模型实现异构教师到代理的知识转移，第二层采用样本级选择性蒸馏（SPD），根据样本的预测正确性和置信度筛选最可靠的代理进行蒸馏，抑制错误信号。
�� 采用代理参数的凸组合初始化UNIEGO模型，确保在训练开始时处于良好的损失景观区域，提升训练稳定性和泛化能力。
�� 实验验证显示，该框架在多个任务和数据集上均优于现有SOTA，证明了其在多源异构知识融合中的有效性。

Methodology

�� 代理学习（Proxy Learning）：利用多个教师模型（Tr）对不同模态、视角和基础模型进行特征提取，将其知识通过特定的代理模型（Pr）转换为统一的自我中心空间。每个代理模型架构相同，参数独立，采用特征蒸馏（cosine距离和交叉熵损失）优化。
�� 代理合并（Proxy Merging）：在第二阶段，基于训练集最小化分类损失，学习代理参数的凸组合系数（α），初始化UNIEGO模型，确保模型在损失景观中处于平坦区域。
�� 选择性代理蒸馏（SPD）：对每个样本，筛选预测正确且置信度高的代理（通过交叉熵判断），在此基础上进行特征和logits的蒸馏（余弦距离和KL散度），抑制错误信号的干扰。
�� 训练流程：先进行第一层代理学习，得到多个代理模型；然后通过代理合并初始化UNIEGO，再进行样本级选择性蒸馏，最终得到具有丰富多源知识的自我中心编码器。

Experiments

�� 数据集：采用EgoExo-Fitness、Assembly101和EgoExo4D三大公开数据集，涵盖不同场景和动作类别，评估动作识别、视频检索和动作分割任务。
�� 实验设计：使用TimeSformer作为基础骨架，训练UNIEGO模型，代理模型覆盖多模态（RGB、深度、骨架）和多视角（ego、exo），训练细节包括15轮训练、批量大小8、学习率逐步下降。
�� 对比基线：包括单一模型、Naive多教师蒸馏、以及其他SOTA模型（如π-ViT、ST-GCN等）。
�� 评估指标：动作识别采用Top-1准确率，视频检索用mAP，动作分割用F1、编辑距离和帧准确率。
�� 消融实验：验证代理模型、合并策略和选择机制对性能的贡献，分析不同超参数设置的影响。

Results

�� 在动作识别任务中，UNIEGO在EgoExo-Fitness达84.7%，超越Naive蒸馏（81.5%）和π-ViT（80.1%），提升显著。Assembly101上达50.7%，优于其他方法的48.2%。EgoExo4D中也取得41.1%，优于对比方法。
�� 在视频检索任务中，UNIEGO的mAP达0.543，明显优于Naive蒸馏（0.486）和TimeSformer（0.474）。
�� 在动作分割任务中，UNIEGO的F1@50为12.3，优于Naive蒸馏的9.8，验证其对细粒度时序信息的捕获能力。
�� 消融实验显示，代理模型、合并策略和样本选择机制均对性能提升起到关键作用，尤其是SPD的样本筛选显著降低了错误信号干扰。

Applications

�� 立即应用：该模型可在增强现实、智能监控、机器人感知等场景中实现高效动作识别和行为分析，尤其适合资源有限的边缘设备。
�� 长远愿景：未来可结合自监督学习和多任务学习，打造更通用、更鲁棒的多模态理解系统，推动多源信息的深度融合，支持多场景、多任务的智能应用。

Limitations & Outlook

�� 代理模型的训练和维护增加了计算成本，尤其在多模态、多视角场景下，模型规模和存储需求较大。
�� 在极端模态缺失或教师模型性能极差的情况下，代理知识的质量可能下降，影响最终模型表现。
�� 当前框架主要验证于动作识别、检索和分割任务，尚未充分验证在多任务或实时应用中的适应性和效率，未来需优化推理速度和资源消耗。

Abstract

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

cs.CV cs.LG

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation