Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

TL;DR

CUActSpot benchmark enhances GUI complex interaction performance via data synthesis and multimodal evaluation; Phi-Ground-Any-4B excels.

cs.CV 🔴 Advanced 2026-05-13 71 views

Miaosen Zhang Xiaohan Zhao Zhihong Tan Zhou Huoshen Yijia Fan Yifan Yang Kai Qiu Bei Liu Justin Wagle Chenzhong Yin Mingxi Cheng Ji Li Qi Dai Chong Luo Xu Yang Xin Geng Baining Guo

AI Reader Arxiv Page Download PDF

data synthesis benchmark multimodal complex interaction GUI operations

Key Findings

Methodology

This paper introduces a new benchmark, CUActSpot, to evaluate models' capabilities in complex GUI interactions. CUActSpot covers five modalities: GUI, text, table, canvas, and natural image, supporting various action types (click, drag, draw, etc.). Additionally, a renderer-based data synthesis pipeline is designed to automatically generate scenes, record screenshots and element coordinates, and use a large language model to produce matching instructions and action traces.

Key Results

Result 1: The Phi-Ground-Any-4B model excels in the CUActSpot benchmark, outperforming all open-source models with fewer than 32B parameters, demonstrating significant advantages in complex interaction tasks.
Result 2: The 50M samples generated through the data synthesis pipeline significantly improved the training effectiveness of the model, especially in enhancing multimodal interaction capabilities.
Result 3: Experiments show that increasing data diversity substantially improves the model's general interactive capability compared to simply scaling the amount of training data within a single modality.

Significance

This study addresses the current lack of evaluation benchmarks and large-scale datasets for complex GUI operations by introducing the CUActSpot benchmark and data synthesis pipeline. This work not only provides academia with a more realistic evaluation tool but also offers data support and technical reference for the industry in developing smarter computer-use agents.

Technical Contribution

Technical contributions include: 1) Introducing a new benchmark, CUActSpot, covering a broader range of interaction types; 2) Designing a renderer-based data synthesis pipeline capable of automatically generating multimodal interaction data; 3) Experimentally verifying the improvement of model general interactive capability through data diversity.

Novelty

CUActSpot is the first benchmark focused on complex GUI interactions, covering a wider range of interaction types than previous benchmarks. Unlike existing click-centric benchmarks, CUActSpot is more aligned with real-world scenarios, allowing for a more accurate assessment of a model's practical operational capabilities.

Limitations

Limitation 1: Although CUActSpot covers various interaction types, it may not fully encompass all complex interactions in real-world applications.
Limitation 2: Data generated by the synthesis pipeline may not fully align with real-world data, potentially affecting model performance in real scenarios.
Limitation 3: Current models still face performance bottlenecks when handling extremely complex multimodal interactions.

Future Work

Future directions include: 1) Expanding the CUActSpot benchmark to cover more interaction types in real-world applications; 2) Optimizing the data synthesis pipeline to generate data closer to real-world scenarios; 3) Developing more efficient models to handle extremely complex multimodal interactions.

AI Executive Summary

In recent years, computer-use agents (CUAs) have made significant strides in automating on-screen operations. However, existing models still perform poorly in handling complex, low-frequency interactions, limiting user trust. To address this challenge, researchers have proposed a new benchmark, CUActSpot, to evaluate models' capabilities in complex interactions.

CUActSpot covers five modalities: GUI, text, table, canvas, and natural image, and supports various action types such as clicking, dragging, and drawing. Unlike previous benchmarks that primarily focus on GUI widget clicks, CUActSpot is more aligned with real-world scenarios, allowing for a more accurate assessment of a model's practical operational capabilities.

To generate data for training complex interactions, researchers designed a renderer-based data synthesis pipeline. This pipeline can automatically generate scenes for each modality, record screenshots and element coordinates, and use a large language model to produce matching instructions and action traces. Using this approach, researchers generated 50M samples for model pre-training or mid-training.

Experimental results show that the trained Phi-Ground-Any-4B model excels in the CUActSpot benchmark, outperforming all open-source models with fewer than 32B parameters. Additionally, the study found that increasing data diversity substantially improves the model's general interactive capability compared to simply scaling the amount of training data within a single modality.

This research not only provides academia with a more realistic evaluation tool but also offers data support and technical reference for the industry in developing smarter computer-use agents. Future research directions include expanding the CUActSpot benchmark to cover more interaction types in real-world applications and optimizing the data synthesis pipeline to generate data closer to real-world scenarios.

Deep Analysis

Background

Computer-use agents (CUAs) are a key direction for enhancing productivity through automated on-screen operations. Traditional CUAs are primarily divided into command-line interface (CLI) and graphical user interface (GUI) based paradigms. Compared to CLI, GUI offers strong cross-platform generalization and user-friendly interaction. However, existing GUI operation models still show significant deficiencies in handling complex interactions, especially in multimodal and low-frequency interaction scenarios. Although several challenging GUI benchmarks have emerged in recent years, they often focus on single-click operations and fail to fully reflect the complex interaction needs in real applications.

Core Problem

Existing GUI operation models perform poorly in handling complex, low-frequency interactions, mainly due to the lack of evaluation benchmarks and large-scale datasets for complex interactions. This deficiency leads to frequent operational failures in real applications, especially in scenarios involving multimodal interactions. To improve the practical operational capabilities of models, there is an urgent need to develop new benchmarks and data generation methods to cover a broader range of interaction types.

Innovation

The core innovations of this paper include: 1) Introducing a new benchmark, CUActSpot, covering five modalities and various action types, allowing for a more accurate assessment of models' capabilities in complex interactions; 2) Designing a renderer-based data synthesis pipeline that can automatically generate multimodal interaction data, providing rich data support for model training; 3) Experimentally verifying the improvement of model general interactive capability through data diversity, pointing out future research directions.

Methodology

�� Introduce the CUActSpot benchmark, covering five modalities: GUI, text, table, canvas, and natural image.
�� Design a data synthesis pipeline to automatically generate scenes for each modality, record screenshots and element coordinates.
�� Use a large language model to produce matching instructions and action traces.
�� Generate 50M samples for model pre-training or mid-training.
�� Experimentally verify the improvement of model general interactive capability through data diversity.

Experiments

The experimental design includes using the CUActSpot benchmark to evaluate models' capabilities in complex interactions. Researchers generated 50M samples for model pre-training or mid-training and compared the impact of different data compositions on model performance. Various benchmarks, such as ScreenSpot-Pro and UI-Vision, were used to verify model performance in different scenarios. Through ablation studies, researchers analyzed the impact of data diversity and data scale on model performance.

Results

Applications

The CUActSpot benchmark and data synthesis pipeline can be directly applied to evaluate and train computer-use agents, especially in scenarios involving multimodal and complex interactions. The industry can use this tool to develop smarter automation software to enhance productivity and user experience. Additionally, the research results can provide new research directions and technical references for academia.

Limitations & Outlook

Although CUActSpot covers various interaction types, it may not fully encompass all complex interactions in real-world applications. Additionally, data generated by the synthesis pipeline may not fully align with real-world data, potentially affecting model performance in real scenarios. Current models still face performance bottlenecks when handling extremely complex multimodal interactions, requiring further optimization.

Plain Language Accessible to non-experts

Imagine you're in a kitchen preparing a large meal. Traditional benchmarks are like only teaching you how to cook a pot of rice, while CUActSpot requires you to prepare an entire table of dishes, including stir-frying, soup-making, and baking. CUActSpot is like a comprehensive recipe guide, instructing you on how to prepare various dishes. To help you master these skills better, researchers designed a data synthesis method, akin to providing a set of virtual kitchen tools, allowing you to practice cooking in a virtual environment. Through continuous practice, you not only master the preparation of each dish but also improve your overall cooking skills. Eventually, you'll be able to prepare various delicacies in any kitchen, not just limited to a specific dish.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super complex game where you need to control multiple characters at once. Existing game tutorials only teach you how to click one button, but our CUActSpot is like a brand new game guide, teaching you how to operate characters in different scenarios, like dragging, drawing, and clicking. To help you get the hang of it faster, we've also designed a virtual training scenario, like a simulated game environment, where you can repeatedly practice various operations. After some training, you'll not only be able to tackle various challenges in the game with ease but also become a game master, even helping other players solve problems. Isn't that cool?

Glossary

CUActSpot

CUActSpot is a new benchmark for evaluating models' capabilities in complex GUI interactions, covering five modalities and various action types.

Used to assess model performance in complex interactions.

Data Synthesis Pipeline

A renderer-based data generation method capable of automatically generating multimodal interaction data for model training support.

Used for generating training data.

Phi-Ground-Any-4B

A trained model that excels in the CUActSpot benchmark, outperforming all open-source models with fewer than 32B parameters.

Used to evaluate model performance.

Multimodal

Involving multiple data modalities such as GUI, text, table, canvas, and natural image, supporting various action types.

Used to assess model performance across different modalities.

Complex Interaction

Complex GUI interactions involving various action types such as clicking, dragging, and drawing.

Used to evaluate models' practical operational capabilities.

Large Language Model (LLM)

A model used to generate matching instructions and action traces, supporting the data synthesis pipeline.

Used for generating training data.

ScreenSpot-Pro

An existing GUI benchmark primarily focused on single-click operations.

Used to compare performance with CUActSpot.

UI-Vision

An existing GUI benchmark primarily focused on single-click operations.

Used to compare performance with CUActSpot.

Ablation Study

A method of analyzing model performance by removing or altering experimental conditions.

Used to verify the impact of data diversity on model performance.

Natural Image

A data modality involving operations on natural images, such as clicking or dragging over specific image regions.

Used to assess model performance on natural images.

Open Questions Unanswered questions from this research

1 Open Question 1: Although CUActSpot covers various interaction types, it may not fully encompass all complex interactions in real-world applications. Future research needs to further expand the benchmark to cover more real-world application scenarios.
2 Open Question 2: Data generated by the synthesis pipeline may not fully align with real-world data, potentially affecting model performance in real scenarios. More advanced data generation methods are needed to improve data authenticity.
3 Open Question 3: Current models still face performance bottlenecks when handling extremely complex multimodal interactions. Further optimization of model structure and training methods is needed to improve performance.
4 Open Question 4: How can we improve the model's general interactive capability without increasing computational costs? New model architectures and training strategies need to be explored.
5 Open Question 5: How can we achieve better knowledge transfer between different modalities? Cross-modal learning methods and techniques need to be studied.

Applications

Immediate Applications

Smart Software Automation

The CUActSpot benchmark and data synthesis pipeline can be used to develop smarter automation software, enhancing productivity and user experience.

Multimodal Interaction Evaluation

The industry can use CUActSpot to evaluate computer-use agents' performance in multimodal interaction scenarios, optimizing product design.

Academic Research Support

CUActSpot and the data synthesis pipeline provide new research directions and technical references for academia, promoting the development of related fields.

Long-term Vision

General Intelligent Agents

By continuously optimizing CUActSpot and the data synthesis pipeline, it is expected to develop general intelligent agents capable of handling various complex interactions.

Cross-Modal Knowledge Transfer

Future research may achieve knowledge transfer between different modalities, improving model performance in multimodal scenarios.

Abstract

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

cs.CV

References (20)

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu et al.

2024 480 citations ⭐ Influential View Analysis →

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak, Xiangru Jian, K. Lin et al.

2025 51 citations ⭐ Influential View Analysis →

Topological structural analysis of digitized binary images by border following

Satoshi Suzuki, K. Abe

1985 2852 citations

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou, Ruohan Wang, Boyuan Zheng et al.

2024 326 citations View Analysis →

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.

2025 94 citations View Analysis →

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Haiyang Xu, Xi Zhang, Hao Liu et al.

2026 12 citations View Analysis →

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, A. A. Awan et al.

2024 2168 citations View Analysis →

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang, Bowen Wang, Dunjie Lu et al.

2025 74 citations View Analysis →

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie et al.

2025 112 citations View Analysis →

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song et al.

2025 117 citations View Analysis →

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil et al.

2024 507 citations View Analysis →

InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

Yuhang Liu, Zeyu Liu, Shuanghe Zhu et al.

2025 16 citations View Analysis →

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang et al.

2024 277 citations View Analysis →

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Kaixin Li, Ziyang Meng, Hongzhan Lin et al.

2025 187 citations View Analysis →

An Illusion of Progress? Assessing the Current State of Web Agents

Tianci Xue, Weijian Qi, Tianneng Shi et al.

2025 101 citations View Analysis →

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu et al.

2023 1250 citations View Analysis →

CogAgent: A Visual Language Model for GUI Agents

Wenyi Hong, Weihan Wang, Qingsong Lv et al.

2023 725 citations View Analysis →

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Xiao Liu, Tianjie Zhang, Yu Gu et al.

2024 80 citations View Analysis →

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen et al.

2024 664 citations View Analysis →

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu et al.

2025 109 citations View Analysis →

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

CUActSpot

Data Synthesis Pipeline

Phi-Ground-Any-4B

Multimodal

Complex Interaction

Large Language Model (LLM)

ScreenSpot-Pro

UI-Vision

Ablation Study

Natural Image

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Smart Software Automation

Multimodal Interaction Evaluation

Academic Research Support

Long-term Vision

General Intelligent Agents

Cross-Modal Knowledge Transfer

Abstract

References (20)

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence