CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

TL;DR

CalTennis is a large multi-view tennis video dataset with over 11 million frames, used to evaluate monocular-to-3D pose estimation, highlighting challenges in depth and foot contact accuracy.

cs.CV 🔴 Advanced 2026-06-19 11 views

Ilona Demler Xinran Xie Blake Werner Anna Szczuka Pietro Perona

AI Reader Arxiv Page Download PDF

Human Pose Estimation Multi-view Video Sports Analytics Depth Estimation Motion Analysis

Key Findings

Methodology

This study introduces a label-free evaluation framework based on multi-view consistency to assess monocular 3D human pose estimation accuracy. Using 2-6 synchronized consumer cameras capturing tennis practice and matches, the authors developed an automated calibration and synchronization pipeline leveraging court geometry and temporal alignment. They employed the SMPL-X model to reconstruct poses and measured multi-view discrepancies—such as translation, joint angles, and shape—serving as lower bounds for error. Five state-of-the-art monocular pose estimation models, including PromptHMR, WHAM, GVHMR, TRAM, and GENMO, were evaluated on the CalTennis dataset, focusing on metrics like MPJPE, PA-MPJPE, and novel measures like footwork and stability to reveal failure modes in depth, foot contact, and body shape estimation.

Key Results

On CalTennis, although joint-angle accuracy (MPJPE around 105mm) was relatively high, depth estimates exhibited significant instability, with an average error of 942mm, causing rapid, unrealistic shifts in estimated body position. Foot contact detection was inconsistent across frames and views, often misclassifying foot floating or contact states. Body shape estimates varied notably between views, with limb lengths and proportions differing systematically, impacting downstream biomechanical analysis. The models’ overall multi-view consistency was substantially worse than in laboratory datasets, indicating a gap between controlled and real-world scenarios.
Among the evaluated models, PromptHMR achieved the lowest translation and pose errors (average 0.942m and 105mm respectively), but still exhibited large errors in depth and shape. WHAM excelled in foot velocity and foot height consistency, with foot velocity error at 0.72m/s and foot height error at 0.06m, owing to its iterative refinement process. GENMO demonstrated superior stability in body shape and foot height metrics, yet overall errors remained high compared to lab-based benchmarks. These results underscore the persistent challenge of accurate depth and contact estimation in unconstrained sports scenes.
The introduction of footwork and stability metrics provided new insights into model performance. Footwork measures the agreement of foot joint velocities and heights across views, exposing failures in dynamic motion capture. Stability assesses whether the estimated center of mass aligns with grounded foot positions, revealing inconsistencies in motion balance. These metrics uncovered failure modes invisible to traditional error measures, emphasizing the need for models that better integrate multi-view geometric cues and temporal consistency to improve depth perception and motion stability.

Significance

This work marks a significant advancement in human motion analysis by providing a large-scale, real-world, multi-view dataset tailored for sports scenarios, particularly tennis. Unlike traditional MOCAP systems, CalTennis leverages accessible consumer cameras, enabling scalable data collection in natural environments. The evaluation framework based on multi-view consistency offers a practical, label-free approach to benchmark monocular 3D pose estimation, addressing the critical gap between laboratory accuracy and real-world robustness. The dataset and metrics facilitate research into challenging aspects such as depth estimation, foot contact, and body shape consistency, which are vital for applications in sports analytics, injury prevention, and biomechanics. Overall, this study pushes the boundary of monocular pose estimation towards real-world deployment, with broad implications for sports science, robotics, and embodied AI.

Technical Contribution

Key technical innovations include: • Development of a multi-view, label-free evaluation framework that uses geometric consistency as a lower bound for error, eliminating the need for costly ground-truth annotations; • Automated camera calibration and synchronization pipeline utilizing court geometry and timestamp alignment, making multi-view data collection accessible with consumer devices; • Construction of the CalTennis dataset, comprising over 11 million frames capturing diverse tennis motions under natural conditions, providing a rich resource for training and benchmarking; • Introduction of new metrics—footwork and stability—that quantify motion detail fidelity and pose balance, revealing failure modes in depth and shape estimation that are critical for downstream applications.

Novelty

This research's novelty lies in: • First large-scale, multi-view, real-world tennis dataset captured with consumer-grade cameras, enabling scalable data collection outside laboratory settings; • The innovative use of multi-view disagreement as a label-free error metric, bypassing the need for expensive ground-truth data; • The design of new evaluation metrics that focus on motion dynamics and pose stability, exposing weaknesses in existing monocular models under natural, unconstrained conditions. These contributions collectively advance the field by providing practical tools and datasets for real-world pose estimation challenges.

Limitations

Despite the large dataset and advanced evaluation framework, the models still struggle with depth estimation stability, especially during rapid movements and occlusions, limiting their immediate applicability in high-precision tasks.
Automated calibration relies on court geometry and assumes minimal calibration errors; in more complex or cluttered environments, calibration inaccuracies could affect evaluation reliability.
Current models exhibit significant cross-view shape and size inconsistencies, indicating a need for better multi-view feature integration and shape priors to improve robustness in natural scenes.

Future Work

Future directions include integrating multi-view geometric cues with deep learning models to enhance depth stability, developing real-time calibration and synchronization methods for live applications, and expanding datasets to include more diverse sports and outdoor activities. Additionally, exploring multi-modal data fusion, such as combining IMU sensors or depth cameras, could further improve pose accuracy and stability. The ultimate goal is to enable robust, real-time, markerless motion capture suitable for widespread sports analytics, injury prevention, and human-computer interaction in uncontrolled environments.

AI Executive Summary

Accurate three-dimensional human pose estimation has long been a cornerstone of motion analysis, with applications spanning healthcare, sports, entertainment, and robotics. Traditional motion capture (MOCAP) systems, while highly precise, are prohibitively expensive, requiring specialized equipment and controlled environments. This limits their deployment in real-world scenarios, especially in outdoor sports or daily activities. Over the past decade, monocular video-based pose estimation has gained traction due to its low cost and ease of use, but its performance in unconstrained environments remains inadequate.

This paper introduces CalTennis, a groundbreaking large-scale multi-view tennis video dataset designed to evaluate monocular-to-3D pose estimation in natural settings. Collected using 2-6 consumer-grade synchronized cameras placed around tennis courts, CalTennis encompasses over 11 million frames (51 hours) from 40 players, capturing diverse actions such as serves, volleys, and footwork. The dataset’s scale and multi-view nature enable a novel, label-free evaluation approach based on multi-view geometric consistency, sidestepping the need for expensive ground-truth annotations.

The authors developed an automated calibration and synchronization pipeline that leverages court geometry and timestamp alignment, allowing anyone with basic equipment to replicate data collection. This democratizes large-scale data gathering and fosters broader research participation. Using this setup, five state-of-the-art monocular pose estimation models were evaluated, revealing significant gaps in depth accuracy, foot contact detection, and shape consistency. While joint-angle estimates were relatively accurate, depth errors averaged 942mm, and foot contact detection was often inconsistent, especially during dynamic motions.

To better understand these shortcomings, the study introduced two new metrics—footwork and stability—that quantify motion detail fidelity and pose balance. These metrics exposed failure modes in existing models, such as oscillations in estimated body position and inconsistent shape reconstructions across views. The findings underscore that, despite progress, current models still face substantial challenges in real-world, high-speed sports scenarios.

The significance of this work lies in its practical approach to large-scale, real-world data collection and evaluation. By enabling inexpensive, scalable data gathering and providing comprehensive benchmarks, it paves the way for more robust, application-ready pose estimation systems. The dataset and evaluation framework will serve as valuable resources for researchers aiming to improve depth perception, motion stability, and shape accuracy, ultimately advancing the deployment of markerless motion capture in sports analytics, injury prevention, and beyond.

Despite these advances, limitations remain. The models' instability in depth estimation during rapid movements and occlusions highlights the need for integrating multi-view cues more effectively. Future work will focus on enhancing robustness, expanding dataset diversity, and exploring multi-modal data fusion to realize real-time, markerless motion capture for widespread use in natural environments.

Deep Dive

Abstract

The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.

cs.CV

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation