Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
CUActSpot benchmark enhances GUI complex interaction performance via data synthesis and multimodal evaluation; Phi-Ground-Any-4B excels.
Key Findings
Methodology
This paper introduces a new benchmark, CUActSpot, to evaluate models' capabilities in complex GUI interactions. CUActSpot covers five modalities: GUI, text, table, canvas, and natural image, supporting various action types (click, drag, draw, etc.). Additionally, a renderer-based data synthesis pipeline is designed to automatically generate scenes, record screenshots and element coordinates, and use a large language model to produce matching instructions and action traces.
Key Results
- Result 1: The Phi-Ground-Any-4B model excels in the CUActSpot benchmark, outperforming all open-source models with fewer than 32B parameters, demonstrating significant advantages in complex interaction tasks.
- Result 2: The 50M samples generated through the data synthesis pipeline significantly improved the training effectiveness of the model, especially in enhancing multimodal interaction capabilities.
- Result 3: Experiments show that increasing data diversity substantially improves the model's general interactive capability compared to simply scaling the amount of training data within a single modality.
Significance
This study addresses the current lack of evaluation benchmarks and large-scale datasets for complex GUI operations by introducing the CUActSpot benchmark and data synthesis pipeline. This work not only provides academia with a more realistic evaluation tool but also offers data support and technical reference for the industry in developing smarter computer-use agents.
Technical Contribution
Technical contributions include: 1) Introducing a new benchmark, CUActSpot, covering a broader range of interaction types; 2) Designing a renderer-based data synthesis pipeline capable of automatically generating multimodal interaction data; 3) Experimentally verifying the improvement of model general interactive capability through data diversity.
Novelty
CUActSpot is the first benchmark focused on complex GUI interactions, covering a wider range of interaction types than previous benchmarks. Unlike existing click-centric benchmarks, CUActSpot is more aligned with real-world scenarios, allowing for a more accurate assessment of a model's practical operational capabilities.
Limitations
- Limitation 1: Although CUActSpot covers various interaction types, it may not fully encompass all complex interactions in real-world applications.
- Limitation 2: Data generated by the synthesis pipeline may not fully align with real-world data, potentially affecting model performance in real scenarios.
- Limitation 3: Current models still face performance bottlenecks when handling extremely complex multimodal interactions.
Future Work
Future directions include: 1) Expanding the CUActSpot benchmark to cover more interaction types in real-world applications; 2) Optimizing the data synthesis pipeline to generate data closer to real-world scenarios; 3) Developing more efficient models to handle extremely complex multimodal interactions.
AI Executive Summary
In recent years, computer-use agents (CUAs) have made significant strides in automating on-screen operations. However, existing models still perform poorly in handling complex, low-frequency interactions, limiting user trust. To address this challenge, researchers have proposed a new benchmark, CUActSpot, to evaluate models' capabilities in complex interactions.
CUActSpot covers five modalities: GUI, text, table, canvas, and natural image, and supports various action types such as clicking, dragging, and drawing. Unlike previous benchmarks that primarily focus on GUI widget clicks, CUActSpot is more aligned with real-world scenarios, allowing for a more accurate assessment of a model's practical operational capabilities.
To generate data for training complex interactions, researchers designed a renderer-based data synthesis pipeline. This pipeline can automatically generate scenes for each modality, record screenshots and element coordinates, and use a large language model to produce matching instructions and action traces. Using this approach, researchers generated 50M samples for model pre-training or mid-training.
Experimental results show that the trained Phi-Ground-Any-4B model excels in the CUActSpot benchmark, outperforming all open-source models with fewer than 32B parameters. Additionally, the study found that increasing data diversity substantially improves the model's general interactive capability compared to simply scaling the amount of training data within a single modality.
This research not only provides academia with a more realistic evaluation tool but also offers data support and technical reference for the industry in developing smarter computer-use agents. Future research directions include expanding the CUActSpot benchmark to cover more interaction types in real-world applications and optimizing the data synthesis pipeline to generate data closer to real-world scenarios.
Deep Analysis
Background
Computer-use agents (CUAs) are a key direction for enhancing productivity through automated on-screen operations. Traditional CUAs are primarily divided into command-line interface (CLI) and graphical user interface (GUI) based paradigms. Compared to CLI, GUI offers strong cross-platform generalization and user-friendly interaction. However, existing GUI operation models still show significant deficiencies in handling complex interactions, especially in multimodal and low-frequency interaction scenarios. Although several challenging GUI benchmarks have emerged in recent years, they often focus on single-click operations and fail to fully reflect the complex interaction needs in real applications.
Core Problem
Existing GUI operation models perform poorly in handling complex, low-frequency interactions, mainly due to the lack of evaluation benchmarks and large-scale datasets for complex interactions. This deficiency leads to frequent operational failures in real applications, especially in scenarios involving multimodal interactions. To improve the practical operational capabilities of models, there is an urgent need to develop new benchmarks and data generation methods to cover a broader range of interaction types.
Innovation
The core innovations of this paper include: 1) Introducing a new benchmark, CUActSpot, covering five modalities and various action types, allowing for a more accurate assessment of models' capabilities in complex interactions; 2) Designing a renderer-based data synthesis pipeline that can automatically generate multimodal interaction data, providing rich data support for model training; 3) Experimentally verifying the improvement of model general interactive capability through data diversity, pointing out future research directions.
Methodology
- �� Introduce the CUActSpot benchmark, covering five modalities: GUI, text, table, canvas, and natural image.
- �� Design a data synthesis pipeline to automatically generate scenes for each modality, record screenshots and element coordinates.
- �� Use a large language model to produce matching instructions and action traces.
- �� Generate 50M samples for model pre-training or mid-training.
- �� Experimentally verify the improvement of model general interactive capability through data diversity.
Experiments
The experimental design includes using the CUActSpot benchmark to evaluate models' capabilities in complex interactions. Researchers generated 50M samples for model pre-training or mid-training and compared the impact of different data compositions on model performance. Various benchmarks, such as ScreenSpot-Pro and UI-Vision, were used to verify model performance in different scenarios. Through ablation studies, researchers analyzed the impact of data diversity and data scale on model performance.
Results
Experimental results show that the trained Phi-Ground-Any-4B model excels in the CUActSpot benchmark, outperforming all open-source models with fewer than 32B parameters. Additionally, the study found that increasing data diversity substantially improves the model's general interactive capability compared to simply scaling the amount of training data within a single modality. Through ablation studies, researchers further verified the significant improvement of model performance through data diversity.
Applications
The CUActSpot benchmark and data synthesis pipeline can be directly applied to evaluate and train computer-use agents, especially in scenarios involving multimodal and complex interactions. The industry can use this tool to develop smarter automation software to enhance productivity and user experience. Additionally, the research results can provide new research directions and technical references for academia.
Limitations & Outlook
Although CUActSpot covers various interaction types, it may not fully encompass all complex interactions in real-world applications. Additionally, data generated by the synthesis pipeline may not fully align with real-world data, potentially affecting model performance in real scenarios. Current models still face performance bottlenecks when handling extremely complex multimodal interactions, requiring further optimization.
Plain Language Accessible to non-experts
Imagine you're in a kitchen preparing a large meal. Traditional benchmarks are like only teaching you how to cook a pot of rice, while CUActSpot requires you to prepare an entire table of dishes, including stir-frying, soup-making, and baking. CUActSpot is like a comprehensive recipe guide, instructing you on how to prepare various dishes. To help you master these skills better, researchers designed a data synthesis method, akin to providing a set of virtual kitchen tools, allowing you to practice cooking in a virtual environment. Through continuous practice, you not only master the preparation of each dish but also improve your overall cooking skills. Eventually, you'll be able to prepare various delicacies in any kitchen, not just limited to a specific dish.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super complex game where you need to control multiple characters at once. Existing game tutorials only teach you how to click one button, but our CUActSpot is like a brand new game guide, teaching you how to operate characters in different scenarios, like dragging, drawing, and clicking. To help you get the hang of it faster, we've also designed a virtual training scenario, like a simulated game environment, where you can repeatedly practice various operations. After some training, you'll not only be able to tackle various challenges in the game with ease but also become a game master, even helping other players solve problems. Isn't that cool?
Glossary
CUActSpot
CUActSpot is a new benchmark for evaluating models' capabilities in complex GUI interactions, covering five modalities and various action types.
Used to assess model performance in complex interactions.
Data Synthesis Pipeline
A renderer-based data generation method capable of automatically generating multimodal interaction data for model training support.
Used for generating training data.
Phi-Ground-Any-4B
A trained model that excels in the CUActSpot benchmark, outperforming all open-source models with fewer than 32B parameters.
Used to evaluate model performance.
Multimodal
Involving multiple data modalities such as GUI, text, table, canvas, and natural image, supporting various action types.
Used to assess model performance across different modalities.
Complex Interaction
Complex GUI interactions involving various action types such as clicking, dragging, and drawing.
Used to evaluate models' practical operational capabilities.
Large Language Model (LLM)
A model used to generate matching instructions and action traces, supporting the data synthesis pipeline.
Used for generating training data.
ScreenSpot-Pro
An existing GUI benchmark primarily focused on single-click operations.
Used to compare performance with CUActSpot.
UI-Vision
An existing GUI benchmark primarily focused on single-click operations.
Used to compare performance with CUActSpot.
Ablation Study
A method of analyzing model performance by removing or altering experimental conditions.
Used to verify the impact of data diversity on model performance.
Natural Image
A data modality involving operations on natural images, such as clicking or dragging over specific image regions.
Used to assess model performance on natural images.
Open Questions Unanswered questions from this research
- 1 Open Question 1: Although CUActSpot covers various interaction types, it may not fully encompass all complex interactions in real-world applications. Future research needs to further expand the benchmark to cover more real-world application scenarios.
- 2 Open Question 2: Data generated by the synthesis pipeline may not fully align with real-world data, potentially affecting model performance in real scenarios. More advanced data generation methods are needed to improve data authenticity.
- 3 Open Question 3: Current models still face performance bottlenecks when handling extremely complex multimodal interactions. Further optimization of model structure and training methods is needed to improve performance.
- 4 Open Question 4: How can we improve the model's general interactive capability without increasing computational costs? New model architectures and training strategies need to be explored.
- 5 Open Question 5: How can we achieve better knowledge transfer between different modalities? Cross-modal learning methods and techniques need to be studied.
Applications
Immediate Applications
Smart Software Automation
The CUActSpot benchmark and data synthesis pipeline can be used to develop smarter automation software, enhancing productivity and user experience.
Multimodal Interaction Evaluation
The industry can use CUActSpot to evaluate computer-use agents' performance in multimodal interaction scenarios, optimizing product design.
Academic Research Support
CUActSpot and the data synthesis pipeline provide new research directions and technical references for academia, promoting the development of related fields.
Long-term Vision
General Intelligent Agents
By continuously optimizing CUActSpot and the data synthesis pipeline, it is expected to develop general intelligent agents capable of handling various complex interactions.
Cross-Modal Knowledge Transfer
Future research may achieve knowledge transfer between different modalities, improving model performance in multimodal scenarios.
Abstract
Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git
References (20)
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu et al.
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Shravan Nayak, Xiangru Jian, K. Lin et al.
Topological structural analysis of digitized binary images by border following
Satoshi Suzuki, K. Abe
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Boyu Gou, Ruohan Wang, Boyuan Zheng et al.
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Hao Liu et al.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, A. A. Awan et al.
OpenCUA: Open Foundations for Computer-Use Agents
Xinyuan Wang, Bowen Wang, Dunjie Lu et al.
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Yuhang Liu, Pengxiang Li, Congkai Xie et al.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Haoming Wang, Haoyang Zou, Huatong Song et al.
GPT-4V(ision) is a Generalist Web Agent, if Grounded
Boyuan Zheng, Boyu Gou, Jihyung Kil et al.
InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization
Yuhang Liu, Zeyu Liu, Shuanghe Zhu et al.
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang et al.
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
Kaixin Li, Ziyang Meng, Hongzhan Lin et al.
An Illusion of Progress? Assessing the Current State of Web Agents
Tianci Xue, Weijian Qi, Tianneng Shi et al.
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu et al.
CogAgent: A Visual Language Model for GUI Agents
Wenyi Hong, Weihan Wang, Qingsong Lv et al.
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao Liu, Tianjie Zhang, Yu Gu et al.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Tianbao Xie, Danyang Zhang, Jixuan Chen et al.
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Saaket Agashe, Kyle Wong, Vincent Tu et al.