AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

TL;DR

AnyMo proposes geometry-aware setup-agnostic human motion modeling, achieving 11.7% zero-shot recognition and 28.6% cross-modal retrieval MRR improvements.

cs.CV 🔴 Advanced 2026-05-22 51 views

Baiyu Chen Zechen Li Wilson Wongso Lihuan Li Xiachong Lin Hao Xue Benjamin Tag Flora Salim

AI Reader Arxiv Page Download PDF

Human Motion Modeling Inertial Measurement Unit Geometry-Aware Zero-Shot Learning Multimodal Alignment

Key Findings

Methodology

AnyMo is a geometry-aware framework for setup-agnostic human motion modeling that leverages physics-grounded IMU simulation on dense body-surface placements derived from the Nymeria body model to generate diverse and plausible synthetic inertial signals. It pre-trains a spatio-temporal graph convolutional encoder using paired synthetic placement views and masked partial observations via masked cross-view predictive contrastive learning, enabling the model to learn robust motion representations invariant to sensor setup. Subsequently, AnyMo employs a product-quantized variational autoencoder (VQ-VAE) to tokenize multi-position IMU data into compact full-body motion tokens. These tokens are then aligned with a large language model (LLM) through multi-task contrastive instruction tuning, facilitating motion-language understanding and generation across zero-shot recognition, cross-modal retrieval, and motion captioning tasks.

Key Results

On zero-shot human activity recognition across 14 unseen downstream datasets, AnyMo achieves an average accuracy improvement of 11.7%, macro-F1 increase of 11.6%, and Recall@2 boost of 22.6%, outperforming state-of-the-art baselines such as ImageBind and IMU2CLIP.
In bidirectional IMU-to-text and text-to-IMU cross-modal retrieval tasks, AnyMo improves mean reciprocal rank (MRR) by 15.9% and 28.6%, respectively, demonstrating effective motion-language alignment.
For zero-shot motion captioning, AnyMo attains an 18.8% increase in BERT-F1 score, indicating more accurate and semantically rich natural language descriptions of wearable IMU motion.

Significance

AnyMo addresses the fundamental challenge of IMU signal dependency on sensor placement, orientation, and hardware variations, which traditionally hinder cross-device and cross-dataset generalization. By integrating geometry-aware dense IMU simulation with setup-agnostic representation learning and full-body tokenization aligned to large language models, AnyMo enables robust, generalized human motion understanding in the wild. This advancement facilitates broader deployment of wearable IMUs beyond closed-set recognition, supporting continuous, context-aware AI systems in real-world environments and bridging the gap between inertial sensing and natural language understanding.

Technical Contribution

AnyMo introduces a novel physics-grounded, geometry-aware IMU simulation framework based on the Nymeria body model, generating dense synthetic IMU signals across diverse body-surface placements. It innovatively applies masked cross-view predictive contrastive learning on spatio-temporal graph convolutional networks to learn setup-invariant motion representations. The framework further pioneers full-body IMU tokenization via product quantization VAE, producing compact discrete tokens that serve as an interface to large language models. Multi-task contrastive instruction tuning aligns these tokens with natural language, enabling zero-shot recognition, cross-modal retrieval, and motion captioning. This comprehensive integration of simulation, representation learning, tokenization, and multimodal alignment distinguishes AnyMo from prior works.

Novelty

AnyMo is the first framework to combine physics-based dense body-surface IMU simulation, setup-agnostic graph-based pretraining, product-quantized full-body IMU tokenization, and large language model alignment for generalist wearable human motion understanding. Unlike prior approaches limited to sparse sensor placements or fixed setups, AnyMo's geometry-aware dense simulation and masked cross-view contrastive learning enable robust cross-setup generalization. Its tokenization strategy effectively bridges continuous multi-sensor inertial data with discrete language tokens, addressing the modality gap in motion-language tasks.

Limitations

AnyMo relies on the accuracy and generality of the Nymeria body model; its performance on extreme or atypical body shapes and motions remains unverified, potentially affecting synthetic signal realism and downstream robustness.
The hardware noise models used in simulation are estimated from limited real IMU streams, which may not capture the full diversity of device variations encountered in practice, possibly impacting real-world performance.
While supporting multi-position IMU inputs, AnyMo's performance may degrade under extremely sparse or missing sensor data scenarios, limiting applicability in constrained sensing environments.

Future Work

Future research directions include extending the Nymeria model to encompass a broader range of body morphologies and dynamic scenarios to enhance simulation diversity and realism; incorporating more sophisticated device noise and sampling variability models to improve robustness across heterogeneous hardware; and exploring more efficient tokenization schemes and larger-scale motion-language joint pretraining to enable real-time, accurate human motion understanding and interaction in diverse real-world applications.

AI Executive Summary

Human motion is a fundamental expression of human context, critical for developing proactive, context-aware AI systems. With the proliferation of wearable and mobile devices equipped with inertial measurement units (IMUs), continuous sensing of human motion in the wild has become feasible. However, IMU signals are highly dependent on sensor setup factors such as body location, mounting position, orientation, hardware, and sampling protocols. This dependency poses significant challenges for learning motion representations that generalize across devices and datasets, limiting the utility of wearable IMUs beyond closed-set activity recognition.

To address these challenges, Chen et al. introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo leverages the Nymeria body model to perform physics-grounded IMU simulation densely over body-surface placements, generating diverse and physically plausible synthetic inertial signals. It pre-trains a spatio-temporal graph convolutional encoder using paired synthetic placement views and masked partial observations through a masked cross-view predictive contrastive learning objective, enabling the model to learn robust representations invariant to sensor setup variations.

AnyMo further innovates by tokenizing multi-position IMU data into compact full-body motion tokens via a product-quantized variational autoencoder. These tokens serve as a stable interface to a large language model (LLM), which is extended with IMU token embeddings and fine-tuned through multi-task contrastive instruction tuning. This alignment facilitates open-vocabulary motion-language understanding, supporting zero-shot activity recognition, cross-modal retrieval, and motion captioning.

Extensive experiments demonstrate AnyMo's superior performance. On zero-shot human activity recognition across 14 unseen datasets, it achieves an average accuracy improvement of 11.7%, macro-F1 increase of 11.6%, and Recall@2 boost of 22.6%. In bidirectional IMU-text retrieval, AnyMo improves mean reciprocal rank by 15.9% and 28.6%, respectively. For zero-shot motion captioning, it attains an 18.8% increase in BERT-F1 score. Ablation studies confirm the critical contributions of geometry-aware simulation, masked cross-view contrastive learning, and tokenization.

AnyMo's contributions advance wearable motion understanding by overcoming setup dependency and bridging inertial sensing with natural language. This enables broader applications in health monitoring, sports analytics, and human-computer interaction. Despite its strengths, AnyMo's reliance on the Nymeria model and limited device noise modeling present challenges for extreme body types and diverse hardware. Future work aims to enhance simulation realism, robustness, and scalability, paving the way for real-time, generalized wearable motion intelligence in the wild.

Deep Analysis

Background

Human motion is a primary indicator of human context and interaction with the environment, essential for developing proactive AI systems that adapt to users' changing states. The advent of wearable and mobile devices equipped with inertial measurement units (IMUs) has enabled continuous sensing of human motion in real-world settings. Traditional motion understanding approaches often rely on visual data or fixed sensor setups, which are impractical for ubiquitous deployment. Recent advances have explored deep learning, graph neural networks, contrastive learning, and multimodal fusion to interpret IMU signals. However, IMU data is inherently dependent on sensor placement, orientation, and hardware characteristics, leading to significant challenges in cross-device and cross-user generalization. Existing synthetic data augmentation methods are limited by sparse sensor placements or fixed activity labels and lack physical and geometric consistency. Moreover, bridging continuous inertial signals with discrete natural language remains an open problem due to modality gaps. AnyMo addresses these challenges by integrating physics-based dense IMU simulation, setup-agnostic representation learning, full-body tokenization, and large language model alignment to achieve robust, generalized human motion understanding.

Core Problem

The core problem tackled by AnyMo is the strong dependency of IMU signals on sensor setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This dependency causes models trained on one setup to perform poorly when transferred to others, limiting scalability and real-world applicability. Additionally, collecting large-scale, richly annotated IMU datasets covering diverse setups is costly and fragmented. Synthetic data augmentation methods often fail to capture the full variability and physical realism of wearable IMUs across body surfaces. Furthermore, the modality gap between continuous multi-sensor inertial data and discrete textual descriptions complicates motion-language alignment, hindering open-vocabulary recognition and generation. Addressing these intertwined challenges requires a framework that can simulate diverse, realistic IMU signals, learn setup-invariant representations, and bridge motion with language effectively.

Innovation

AnyMo's innovations include:

1. Physics-grounded geometry-aware IMU simulation: Utilizing the Nymeria body model, AnyMo simulates IMU signals densely over 23 anatomical segments' body-surface vertices, incorporating local sensor frames and device noise, thereby generating a broad distribution of plausible wearable setups beyond sparse sensor placements.

2. Setup-agnostic spatio-temporal graph encoder pretraining: By sampling paired synthetic placement views and applying masked cross-view predictive contrastive learning, AnyMo trains a graph convolutional network to produce stable motion representations invariant to sensor placement and orientation variations.

3. Full-body IMU tokenization: Employing a product-quantized variational autoencoder, AnyMo discretizes continuous graph encoder outputs into compact IMU token sequences, preserving temporal order and enabling efficient multimodal alignment.

4. Motion-language alignment via multi-task contrastive instruction tuning: Extending the LLM vocabulary with IMU tokens and training with combined contrastive and generative objectives, AnyMo achieves open-vocabulary motion recognition, cross-modal retrieval, and captioning, bridging the modality gap effectively.

Methodology

�� Physics-Grounded IMU Simulation:
Input: Nymeria body mesh and skeleton motion data.
Process: Select candidate surface vertices per anatomical segment; compute local sensor frames based on surface tangents and normals; simulate IMU signals (acceleration and angular velocity) using second-order derivatives of vertex trajectories, subtracting gravity; add hardware noise estimated from real IMU streams.
Output: Dense synthetic IMU candidates covering diverse wearable locations and orientations.

�� Setup-Agnostic Representation Learning:
Input: Paired synthetic IMU views with different placements and orientations.
Process: Construct spatio-temporal graphs with nodes as segment IMU windows; apply random masking of nodes; encode with graph convolutional network; predict full-view latent from masked-view latent of paired view using a Transformer temporal predictor; optimize with cross-view predictive InfoNCE loss.
Output: Setup-invariant motion representations preserving temporal dynamics.

�� Full-Body IMU Tokenization:
Input: Frozen graph encoder outputs from masked IMU observations.
Process: Project latent features; split into product quantization subspaces; quantize each chunk to nearest codebook vector; decode quantized sequence via temporal convolutional decoder; optimize reconstruction and commitment losses.
Output: Discrete IMU token sequences representing full-body motion.

�� Motion-Language Modeling:
Input: IMU token sequences and natural language narrations/activity labels.
Process: Extend LLM vocabulary with IMU tokens; map codebook vectors to LLM embedding space; perform causal language model pretraining; conduct multi-task contrastive instruction tuning combining narration-level and label-level contrastive losses with generative captioning loss.
Output: Aligned motion-language model supporting zero-shot recognition, retrieval, and captioning.

Experiments

AnyMo is pretrained on the Nymeria dataset, which provides synchronized body mesh and skeleton motion, atomic-action text annotations, and real IMU streams from head and wrists. The dense IMU simulation uses mesh and skeleton data, while real IMU streams estimate device noise. No downstream datasets are used during training. Evaluation covers three axes:

1. Zero-shot human activity recognition on 14 unseen datasets spanning diverse body locations, devices, sampling protocols, and activity vocabularies, categorized into easy (<10 classes), medium (10–20 classes), and hard (>20 classes) settings.

2. Bidirectional IMU-text cross-modal retrieval on held-out Nymeria subjects and out-of-domain EgoExo4D dataset.

3. Wearable IMU motion captioning with zero-shot transfer.

Baselines include ImageBind, IMU2CLIP, IMUGPT, HARGPT, UniMTS, NormWear, and Gemma. Metrics include Accuracy, macro-F1, Recall@2 for recognition; Recall@K and MRR for retrieval; BLEU, ROUGE-L, METEOR, and BERT-F1 for captioning. Ablation studies analyze contributions of geometry-aware simulation, masked cross-view learning, and tokenization.

Results

AnyMo achieves an average accuracy of 35.7% on zero-shot human activity recognition across 14 unseen datasets, outperforming the strongest baseline by 11.7%. Macro-F1 improves by 11.6% to 29.5%, and Recall@2 increases by 22.6% to 57.5%. In cross-modal retrieval, IMU-to-text and text-to-IMU mean reciprocal rank (MRR) improve by 15.9% and 28.6%, respectively. For zero-shot motion captioning, BERT-F1 score increases by 18.8%, indicating more accurate natural language descriptions. Ablations confirm that geometry-aware dense simulation and masked cross-view predictive contrastive learning are critical for robust setup-invariant representations, while full-body tokenization enables effective multimodal alignment.

Applications

AnyMo enables continuous, robust human motion understanding in real-world settings, facilitating applications such as intelligent health monitoring for chronic disease management and rehabilitation through accurate multi-position IMU-based activity recognition. In sports analytics, it supports cross-device and cross-user motion capture and evaluation, aiding performance optimization. For human-computer interaction, AnyMo's motion-language alignment enables natural, motion-based command recognition and generation, enhancing wearable device usability. Additionally, it offers wearable device manufacturers a unified motion understanding foundation model, reducing cross-device adaptation costs and fostering a cohesive smart wearable ecosystem.

Limitations & Outlook

AnyMo's reliance on the Nymeria body model limits its validation on extreme body shapes and atypical motions, potentially affecting synthetic IMU realism and downstream performance. The hardware noise priors used in simulation are derived from limited real IMU data, which may not capture the full spectrum of device variability, impacting robustness. Performance may degrade under extremely sparse or missing sensor data, limiting applicability in constrained sensing scenarios. Computational complexity poses challenges for real-time deployment on resource-constrained devices. These limitations highlight areas for future enhancement.

Abstract

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.

cs.CV cs.AI cs.CL cs.HC

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence