Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

TL;DR

Proposes DAR-Net, a transformer-based framework with pixel-level scene supervision, achieving 73.33% accuracy in underwater diver activity recognition on the first UDA dataset.

cs.RO 🔴 Advanced 2026-06-11 52 views

Sadman Sakib Enan Junaed Sattar

AI Reader Arxiv Page Download PDF

Underwater Robotics Activity Recognition Transformer Multimodal Learning Deep Learning Scene Semantics Dataset Multi-task Learning

Key Findings

Methodology

The proposed DAR-Net employs ResNeXt-101 as the backbone for feature extraction, integrating a Transformer module for temporal reasoning. It adopts a multi-task training strategy with a combined loss function comprising classification cross-entropy and pixel-level semantic binary cross-entropy. The input consists of underwater video clips, which are encoded into spatio-temporal features. These features are processed through two branches: a Transformer-based classification branch capturing global temporal context, and an encoder-decoder segmentation branch leveraging scene semantics for local context. Positional encodings are added to preserve spatial information. The training involves data augmentation, dynamic loss weighting, and end-to-end optimization, resulting in a model capable of robust activity recognition even under low visibility conditions.

Key Results

DAR-Net achieved 73.33% accuracy on the test set, outperforming baseline models such as 3DResNet (53.33%) and SlowFast (56.67%). It attained a precision of 76.90%, recall of 73.33%, and F1-score of 72.17%, demonstrating balanced high performance. The incorporation of scene semantics significantly improved focus on relevant regions, reducing false positives. Confusion matrix analysis revealed strong classification across most categories, with some difficulty distinguishing subtle activities like 'busy' and 'robot-diver interaction'. Ablation studies confirmed the importance of semantic supervision, with attention maps showing more precise focus on key scene elements.
These results indicate that the model effectively captures complex underwater activities, even in challenging conditions, and surpasses current state-of-the-art methods in accuracy and robustness.

Significance

This study pioneers the integration of Transformer architectures with pixel-level scene semantics for underwater activity recognition, addressing a critical gap in autonomous underwater human-robot interaction. The approach enhances the perception capabilities of AUVs, enabling them to understand diver activities accurately and reliably in low-visibility, complex environments. The creation of the first large-scale, annotated underwater diver activity dataset (UDA) provides a valuable resource for future research, fostering advancements in underwater perception, safety, and autonomous collaboration. The technology has broad implications for marine exploration, environmental monitoring, and disaster response, where intelligent, cooperative underwater systems are increasingly essential.

Technical Contribution

The main technical contributions include the development of DAR-Net, a novel end-to-end deep learning framework that combines ResNeXt-101 feature extraction with Transformer-based temporal modeling. The innovative multi-task training strategy jointly optimizes activity classification and scene segmentation, guided by pixel-level semantic supervision. The model employs dynamic loss weighting, positional encoding, and a hybrid encoder-decoder architecture, enabling it to focus on critical regions and capture long-range dependencies. The introduction of the UDA dataset, with pixel-level annotations for six activity categories, further advances the field by providing a benchmark for underwater activity recognition. These innovations collectively improve recognition accuracy, robustness, and interpretability in water environments.

Novelty

This work is the first to apply Transformer models to underwater diver activity recognition, integrating pixel-level scene semantics for enhanced focus and interpretability. Unlike previous approaches relying solely on CNNs or shallow features, DAR-Net leverages self-attention to model long-term dependencies, significantly boosting performance. The creation of the UDA dataset, specifically tailored for multi-human-robot underwater interactions with pixel-level annotations, fills a critical gap in available resources. This combination of advanced architecture and specialized dataset marks a substantial step forward in underwater perception research, setting a new standard for future studies.

Limitations

The dataset size, although pioneering, remains limited with approximately 2600 images, which may restrict the model’s ability to generalize to more diverse, real-world open-water scenarios. Larger datasets are needed for broader applicability.
Experiments are confined to controlled, closed-water environments; the model’s robustness in open ocean conditions with varying currents, turbidity, and lighting remains untested.
Recognition of subtle or complex actions, such as differentiating between 'busy' and 'collaborative' activities, still faces challenges due to short video clips and ambiguous visual cues. Further multimodal integration and longer temporal context may be required.

Future Work

Future efforts will focus on expanding the dataset through synthetic data augmentation and real-world open-water collection, improving model generalization. Incorporating additional modalities like acoustic signals and pressure sensors could enhance scene understanding. Developing lightweight, real-time models will facilitate deployment on resource-constrained underwater robots. Moreover, exploring transfer learning and domain adaptation techniques will help adapt模型到不同水域环境。持续优化模型的鲁棒性和效率，将推动水下自主系统在海洋科学、资源勘探和海底安全中的应用落地。

AI Executive Summary

The exploration and utilization of underwater environments have become increasingly vital for scientific, environmental, and industrial purposes. Autonomous underwater vehicles (AUVs) are at the forefront of this revolution, capable of performing complex tasks such as mapping, inspection, and rescue. However, a significant challenge remains: enabling these robots to understand and interpret human diver activities accurately in the challenging water environment. Traditional activity recognition methods, primarily developed for terrestrial scenarios, struggle under water due to poor visibility, dynamic lighting, and complex interactions.

Addressing this gap, the authors introduce DAR-Net, a transformer-based deep learning framework that integrates pixel-level scene semantics to recognize diver activities robustly. The core innovation lies in combining a ResNeXt-101 backbone with a Transformer module for temporal reasoning, guided by a multi-task loss function that jointly optimizes activity classification and scene segmentation. This approach allows the model to focus on critical scene elements such as divers, robots, and objects, even in low-visibility conditions.

To facilitate research, the authors also present the Underwater Diver Activity (UDA) dataset, the first of its kind, comprising over 2600 annotated underwater images across six activity categories. These annotations include pixel-level masks, enabling precise scene understanding. Extensive experiments demonstrate that DAR-Net achieves 73.33% accuracy, outperforming existing models like 3DResNet and SlowFast, and exhibits strong robustness and interpretability. The results highlight the importance of scene semantics in guiding deep models for underwater activity recognition.

This work has profound implications for underwater robotics, enabling safer, more efficient, and autonomous human-robot collaboration. It paves the way for intelligent systems capable of real-time understanding and decision-making in complex aquatic environments. Despite its success, the study acknowledges limitations such as dataset size and environmental scope, with future directions focusing on dataset expansion, multimodal integration, and deployment in open-water scenarios. Overall, this research marks a significant step toward realizing fully autonomous, perceptive underwater robotic systems that can operate seamlessly alongside human divers.

Deep Analysis

Background

The development of underwater robotics has seen rapid progress over the past decades, driven by applications in marine exploration, environmental monitoring, and resource extraction. Early efforts focused on autonomous navigation and target detection using sonar and acoustic sensors. With the advent of deep learning, vision-based methods employing CNNs such as VGG, ResNet, and later 3D CNN variants like C3D and SlowFast have been applied to underwater scene understanding. However, these models primarily address static object detection or simple motion tracking, lacking the capacity to interpret complex human activities. Existing datasets are limited, often capturing isolated diver poses or static scenes, which restricts the training of models capable of understanding nuanced interactions. Recently, Transformer architectures have revolutionized natural language processing and visual tasks, offering superior long-range dependency modeling. Applying these architectures to underwater activity recognition is promising but uncharted territory. The lack of large-scale, annotated datasets tailored for underwater multi-human-robot interactions further hampers progress. This paper bridges these gaps by introducing a novel framework and dataset, advancing the field toward intelligent underwater perception.

Core Problem

The core challenge in underwater diver activity recognition lies in the environment's inherent complexity—poor visibility, dynamic lighting, and water turbidity obscure visual cues. Additionally, the diversity of activities and interactions among multiple divers and robots creates a high-dimensional, ambiguous recognition problem. Existing models struggle to maintain accuracy under these conditions, especially with limited training data. The absence of large, annotated datasets prevents deep models from learning robust features. Furthermore, subtle differences between activities, such as 'busy' versus 'collaborative' states, are difficult to distinguish based solely on short video clips. These issues collectively hinder the deployment of reliable, real-time activity recognition systems essential for safe and efficient underwater operations.

Innovation

This work introduces several key innovations:

1. Transformer Integration: Leveraging Transformer modules for temporal modeling enhances the understanding of long-term dependencies in underwater activities, surpassing traditional CNNs.

2. Scene Semantics Supervision: Pixel-level annotations guide the model's focus on relevant scene elements, improving robustness in low-visibility scenarios.

3. Multi-task Learning: Joint optimization of activity classification and scene segmentation fosters comprehensive scene understanding.

4. Dataset Creation: The UDA dataset provides high-quality, pixel-annotated images across six activity categories, filling a critical resource gap.

These innovations collectively enable the model to better interpret complex, noisy underwater scenes, facilitating accurate activity recognition and interaction understanding.

Methodology

�� Feature Extraction: Utilized ResNeXt-101 to extract deep features from underwater video frames, incorporating positional encoding to preserve spatial information.
�� Temporal Modeling: Input features are processed through a Transformer encoder, employing self-attention to capture long-range temporal dependencies.
�� Multi-task Architecture: The network branches into a classification head (Transformer-based) and a segmentation decoder (encoder-decoder structure), both trained jointly.
�� Loss Functions: Defined a combined loss with classification cross-entropy and pixel-wise semantic binary cross-entropy, with trainable weights α and β to balance tasks.
�� Training Strategy: Employed data augmentation techniques, trained on the UDA dataset for 200 epochs using AdamW optimizer, with a learning rate of 10^-5.
�� Evaluation: Used accuracy, precision, recall, and F1-score on a separate test set, including ablation studies to assess the impact of semantic supervision.

Experiments

�� Dataset: The UDA dataset contains over 2600 images with pixel-level annotations, covering six activity categories in controlled water tank environments.
�� Baselines: Compared DAR-Net against models like 3DResNet, R(2+1)D, SlowFast, and LateTemporal, retrained on the same dataset.
�� Evaluation Metrics: Assessed performance using accuracy, precision, recall, F1-score, and confusion matrices.
�� Hyperparameters: Batch size 4, learning rate 10^-5, 200 epochs, data augmentation applied.
�� Ablation Studies: Compared models with and without semantic supervision, visualized attention maps, and analyzed misclassification cases to validate the importance of scene semantics.

Results

�� DAR-Net achieved 73.33% accuracy, outperforming all baseline models significantly, with the next best being LateTemporal at 66.67%.
�� Precision, recall, and F1-score metrics confirmed the model’s balanced performance, with precision at 76.90% and F1-score at 72.17%.
�� Attention map analysis demonstrated that semantic supervision directs the model’s focus toward relevant scene regions, reducing false positives.
�� Confusion matrices revealed high accuracy in most categories, with some confusion between 'busy' and 'collaborative' activities, indicating areas for further refinement.
�� Ablation results confirmed that scene semantics supervision enhances model focus and accuracy, especially in cluttered or low-visibility scenarios.

Applications

�� Immediate: Deployment in autonomous underwater vehicles for real-time diver activity monitoring, enhancing safety and operational efficiency.
�� Long-term: Development of comprehensive underwater perception systems capable of understanding complex interactions, supporting scientific research, resource management, and disaster response.

Limitations & Outlook

�� Dataset size remains limited, potentially restricting model generalization to diverse real-world scenarios.
�� Experiments are confined to controlled water tank environments; performance in open ocean conditions with variable currents and turbidity needs validation.
�� Recognizing subtle or complex activities remains challenging, requiring further multimodal data integration and longer temporal context analysis.

Plain Language Accessible to non-experts

想象你在一个繁忙的厨房里，厨师们同时准备不同的菜肴。有的在切菜，有的在炒菜，还有的在和别人交流。厨房里灯光不总是很亮，有时候油烟会遮挡视线，但你还是能大致知道谁在做什么。现在，假设你有一个非常聪明的机器人助手，它可以观察厨房里的每个人，记住他们在做什么，还能知道谁在忙，谁在休息，甚至能理解他们之间的交流。这个机器人用一种叫“Transformer”的新技术，能像人一样理解复杂的场景，知道每个人的动作和互动。它还会用一种特殊的“眼睛”——像素级的场景理解，帮助它更准确地判断每个人的具体动作。通过不断学习和观察，这个机器人变得越来越聪明，能在厨房里帮忙分配任务、提醒厨师注意安全，甚至在你不在时帮你管理厨房。这就像我们给机器人装上了“眼睛”和“脑袋”，让它在水下也能像在厨房一样，观察潜水员的动作，理解他们的合作，从而帮助他们完成任务，保证安全。这个技术的核心，就是让机器人变得更聪明、更懂场景，能在复杂环境中自主行动。

ELI14 Explained like you're 14

想象你在游泳池里和朋友们玩水，大家都在做不同的动作。有的人在潜水，有的人在跟朋友聊天，还有人在帮忙搬东西。现在，想象有个超级聪明的机器人，它能看着你们，知道你在做什么，比如你在忙着整理装备，或者在跟朋友打招呼。这个机器人用了一种叫“Transformer”的新技术，能像人一样理解你们的动作和互动。它还会用一种特别的“眼睛”——像素级的场景理解，帮它更清楚地看到你们在水中的位置和动作。这样，它就能知道谁在忙，谁在休息，甚至能理解你们之间的交流。通过不断学习，这个机器人变得越来越聪明，能在水下帮忙，比如提醒你注意安全，或者帮你找到需要的东西。它就像一个会观察、会理解、会帮忙的水下伙伴，让潜水变得更安全、更有趣。这项技术的厉害之处在于，它让机器人变得更聪明，能在复杂的水下环境中自主行动，帮助人类完成各种任务。

Abstract

Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human-robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.

cs.RO cs.CV

References (20)

Real-Time Dense 3D Mapping of Underwater Environments

Weihan Wang, Bharat Joshi, Nathaniel Burgdorfer et al.

2023 41 citations View Analysis →

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, Lubomir D. Bourdev, R. Fergus et al.

2014 9270 citations

Understanding human motion and gestures for underwater human–robot collaboration

M. Islam

2018 91 citations View Analysis →

Aggregated Residual Transformations for Deep Neural Networks

Saining Xie, Ross B. Girshick, Piotr Dollár et al.

2016 11670 citations View Analysis →

Human Activity Recognition using Binary Motion Image and Deep Learning

Tushar Dobhal, Vivswan Shitole, G. Thomas et al.

2015 70 citations

DiverNet — A network of inertial sensors for real time diver visualization

G. Goodfellow, J. Neasham, Ivor Rendulic et al.

2015 10 citations

Recognizing Human Daily Activities From Accelerometer Signal

Jin Wang, Ronghua Chen, Xiangping Sun et al.

2011 68 citations

A general method for human activity recognition in video

N. Robertson, I. Reid

2006 196 citations

A Review on Video-Based Human Activity Recognition

Shian-Ru Ke, L. Hoang, Yong-Jin Lee et al.

2013 431 citations

Event-based analysis of video

Lihi Zelnik-Manor, M. Irani

2001 510 citations

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 34874 citations

Human activity recognition based on silhouette analysis using Local Binary Patterns

Han Su, Jiayun Zou, Wenjie Wang

2013 5 citations

A spatio-temporal recurrent network for salmon feeding action recognition from underwater videos in aquaculture

H. Måløy, A. Aamodt, E. Misimi

2019 134 citations

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

João Carreira, Andrew Zisserman

2017 9555 citations View Analysis →

Video Action Transformer Network

Rohit Girdhar, João Carreira, Carl Doersch et al.

2018 777 citations View Analysis →

Towards Advancing Diver-Robot Interaction Capabilities

Đ. Nađ, Christopher Walker, Igor Kvasić et al.

2019 22 citations

DARE: Diver Action Recognition Encoder for Underwater Human–Robot Interaction

Jing Yang, James P. Wilson, Shalabh Gupta

2023 15 citations

A Review of Human Activity Recognition Methods

Michalis Vrigkas, Christophoros Nikou, I. Kakadiaris

2015 530 citations

Underwater Motion and Activity Recognition using Acoustic Wireless Networks

Haochen Hu, Zhi Sun, Lu Su

2020 5 citations

A Survey on Human Activity Recognition using Wearable Sensors

Oscar D. Lara, M. Labrador

2013 2603 citations

Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

Increasing Resilience of Continuum Robots via Motion Planning Algorithms

ARC: Adaptive Robust Joint State and Covariance Estimation

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies