ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Key Findings

Methodology

ToolCUA employs a staged training paradigm. Initially, an Interleaved GUI-Tool Trajectory Scaling Pipeline generates diverse GUI-Tool trajectories. This is followed by Tool-Bootstrapped GUI Reinforcement Finetuning (RFT), combining warmup Supervised Finetuning (SFT) with single-turn Reinforcement Learning (RL) to improve decisions at critical GUI-Tool switching points. Finally, ToolCUA is optimized with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths.

Key Results

ToolCUA achieves 46.85% accuracy on OSWorld-MCP, a relative improvement of approximately 66% over the baseline, setting a new state of the art among models of comparable scale.
In GUI-only settings, ToolCUA improves by 3.9%, demonstrating effective GUI-Tool orchestration.
ToolCUA shows excellent out-of-distribution generalization, achieving 23.9% accuracy on unseen tasks, indicating strong cross-task and cross-platform generalization capabilities.

Significance

The significance of ToolCUA lies in providing a new paradigm for path selection in computer use agents, addressing the confusion in hybrid action spaces faced by traditional methods. By introducing interleaved GUI-Tool trajectories and a Tool-Efficient Path Reward, ToolCUA not only enhances task completion rates but also significantly shortens execution paths, showcasing potential in real-world digital automation.

Technical Contribution

ToolCUA's technical contributions include its staged training paradigm and tool-bootstrapped reinforcement learning strategy. Unlike existing methods, ToolCUA optimizes GUI-Tool switching decisions at the trajectory level and achieves more efficient path selection through a Tool-Efficient Path Reward. Additionally, ToolCUA demonstrates broad applicability in real-world applications by training agents in hybrid action spaces.

Novelty

ToolCUA is the first to propose an Interleaved GUI-Tool Trajectory Scaling Pipeline combined with a tool-bootstrapped reinforcement learning strategy to address path selection in hybrid action spaces. Compared to existing methods, ToolCUA provides more detailed supervision at the trajectory level, significantly improving task completion efficiency.

Limitations

ToolCUA still has room for improvement in tool invocation accuracy, especially when tools are unavailable or unstable.
In some complex tasks, ToolCUA may still rely on lengthy GUI operations, failing to fully leverage tool invocation.

Future Work

Future research could explore further improving ToolCUA's tool invocation efficiency in complex tasks and validating its generalization capabilities in more diverse application scenarios. Additionally, incorporating more environmental feedback signals could help further optimize path selection.

AI Executive Summary

Computer Use Agents (CUAs) are becoming increasingly important in modern digital workflows. However, traditional CUAs primarily rely on atomic GUI actions, such as clicking and scrolling, which, while broadly applicable, are prone to cascading errors in long-horizon tasks. On the other hand, structured tool calls offer superior efficiency and precision, but their application is limited by service coverage and stability. Therefore, a hybrid GUI-Tool action space is essential for next-generation CUAs.

ToolCUA addresses this issue through a staged training paradigm. Initially, researchers introduced an Interleaved GUI-Tool Trajectory Scaling Pipeline, leveraging abundant static GUI trajectories and a synthesized tool library to generate diverse GUI-Tool trajectories. Then, Tool-Bootstrapped GUI Reinforcement Finetuning (RFT) combines warmup Supervised Finetuning (SFT) with single-turn Reinforcement Learning (RL) to improve decisions at critical GUI-Tool switching points. Finally, ToolCUA is optimized with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths.

Experimental results show that ToolCUA achieves 46.85% accuracy on OSWorld-MCP, a relative improvement of approximately 66% over the baseline, setting a new state of the art among models of comparable scale. Additionally, ToolCUA improves by 3.9% in GUI-only settings, demonstrating effective GUI-Tool orchestration. More importantly, ToolCUA shows excellent out-of-distribution generalization, achieving 23.9% accuracy on unseen tasks, indicating strong cross-task and cross-platform generalization capabilities.

The significance of ToolCUA lies in providing a new paradigm for path selection in computer use agents, addressing the confusion in hybrid action spaces faced by traditional methods. By introducing interleaved GUI-Tool trajectories and a Tool-Efficient Path Reward, ToolCUA not only enhances task completion rates but also significantly shortens execution paths, showcasing potential in real-world digital automation.

However, ToolCUA still has room for improvement in tool invocation accuracy, especially when tools are unavailable or unstable. In some complex tasks, ToolCUA may still rely on lengthy GUI operations, failing to fully leverage tool invocation. Future research could explore further improving ToolCUA's tool invocation efficiency in complex tasks and validating its generalization capabilities in more diverse application scenarios. Additionally, incorporating more environmental feedback signals could help further optimize path selection.

Deep Analysis

Background

With the rapid evolution of Multimodal Large Language Models (MLLMs), Computer Use Agents (CUAs) have become a frontier topic for automating native desktop workflows. Traditionally, CUAs primarily rely on atomic GUI actions, such as clicking and scrolling, which, while broadly applicable, are prone to cascading errors in long-horizon tasks. In contrast, structured tool calls provide superior efficiency and precision. For example, a file operation can be completed by a single API call, whereas a pure GUI solution requires a long sequence of clicks and types. However, tool-based APIs are constrained by service coverage and stability, limiting applicability in diverse scenarios. Therefore, a hybrid GUI-Tool action space is essential for next-generation CUAs.

Core Problem

Although GUI actions and tool calls are complementary, simply exposing both action spaces to an MLLM does not solve the problem. In practice, agents are often confused about when to use GUI actions and when to invoke tools, leading to suboptimal execution paths. Existing approaches fall short in two fundamental aspects. First, there is a lack of high-quality interleaved GUI-Tool trajectories, resulting in a deficit in tool-calling knowledge. Second, existing supervision provides limited guidance for GUI-Tool path selection, as most methods focus on step-level action imitation or final task completion and offer little trajectory-level feedback on whether GUI-Tool switching leads to a more effective execution path.

Innovation

ToolCUA addresses these challenges through a staged training paradigm. Initially, researchers introduced an Interleaved GUI-Tool Trajectory Scaling Pipeline, leveraging abundant static GUI trajectories and a synthesized tool library to generate diverse GUI-Tool trajectories. Then, Tool-Bootstrapped GUI Reinforcement Finetuning (RFT) combines warmup Supervised Finetuning (SFT) with single-turn Reinforcement Learning (RL) to improve decisions at critical GUI-Tool switching points. Finally, ToolCUA is optimized with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths.

Methodology

�� Introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline to generate diverse GUI-Tool trajectories.
�� Tool-Bootstrapped GUI Reinforcement Finetuning (RFT), combining warmup SFT with single-turn RL to improve decisions at critical switching points.
�� Optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward.

Experiments

Experiments were conducted on OSWorld-MCP, using Qwen3-VL-8B-Instruct as the baseline model. The training included three stages: warmup SFT, single-turn RL, and online agentic RL. Evaluation metrics included accuracy, Tool Invocation Rate (TIR), and Average Completion Steps (ACS). Results showed that ToolCUA set a new state of the art among models of comparable scale and demonstrated excellent out-of-distribution generalization.

Results

ToolCUA achieves 46.85% accuracy on OSWorld-MCP, a relative improvement of approximately 66% over the baseline. In GUI-only settings, ToolCUA improves by 3.9%, demonstrating effective GUI-Tool orchestration. ToolCUA shows excellent out-of-distribution generalization, achieving 23.9% accuracy on unseen tasks, indicating strong cross-task and cross-platform generalization capabilities.

Applications

ToolCUA can be used for automating desktop workflows, particularly in scenarios requiring efficient path selection. Its hybrid action space makes it broadly applicable across diverse application scenarios, significantly improving task completion efficiency.

Limitations & Outlook

ToolCUA still has room for improvement in tool invocation accuracy, especially when tools are unavailable or unstable. In some complex tasks, ToolCUA may still rely on lengthy GUI operations, failing to fully leverage tool invocation. Future research could explore further improving ToolCUA's tool invocation efficiency in complex tasks and validating its generalization capabilities in more diverse application scenarios.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Traditionally, you would chop vegetables, fry them, and season them yourself. This is like using GUI actions, where you can complete all the steps but it takes a lot of time and effort. ToolCUA is like a smart kitchen assistant that not only chops the vegetables for you but also adjusts the heat and seasoning according to the recipe. This way, you only need to do a few simple operations to complete a delicious dinner.

ToolCUA achieves more efficient path selection by combining GUI actions and tool invocation. Just like in the kitchen, you can choose to chop vegetables with a knife (GUI action) or use a vegetable chopper (tool invocation). When you need to complete tasks quickly, the vegetable chopper is clearly the better choice.

However, sometimes the vegetable chopper may not be flexible enough, such as when you need to cut vegetables into special shapes. In this case, you need to go back to manual operation. This is like ToolCUA still needing to rely on GUI actions in some complex tasks.

Overall, ToolCUA achieves more efficient task completion by intelligently choosing when to use GUI actions and tool invocation, just like using tools wisely in the kitchen can greatly improve cooking efficiency.

ELI14 Explained like you're 14

Hey there! Today, I'm going to tell you about something super cool called ToolCUA. Imagine you're playing a super complex game where you have to keep clicking and dragging to complete tasks. That's like traditional GUI actions. You can finish the tasks, but sometimes it feels exhausting, right?

ToolCUA is like a game super assistant that can automatically complete some repetitive actions for you, like opening treasure chests with one click or automatically organizing your inventory. That's the magic of tool invocation!

But sometimes, there are special tasks in the game, like solving puzzles, where ToolCUA smartly lets you do it yourself. This way, you can enjoy the fun of the game while completing tasks faster!

In short, ToolCUA is like your game buddy, helping you make the smartest choices in the game so you can play more easily and happily!

Glossary

GUI (Graphical User Interface)

A user interface that allows users to interact with a computer using graphical elements.

In ToolCUA, GUI actions refer to basic operations like clicking and typing.

Tool Invocation

Calling high-level functions through APIs or other methods.

In ToolCUA, tool invocation is used to replace lengthy GUI operations.

Path Selection

Choosing the optimal path among multiple possible execution paths.

ToolCUA achieves more optimal path selection through staged training.

Reinforcement Learning (RL)

A machine learning method that learns optimal policies through interaction with the environment.

ToolCUA uses RL to optimize GUI-Tool switching decisions.

Interleaved Trajectory

A hybrid trajectory combining GUI actions and tool invocation.

ToolCUA generates diverse training data through an interleaved trajectory scaling pipeline.

Tool-Efficient Path Reward

A reward mechanism that encourages agents to use shorter execution paths.

ToolCUA uses this reward to optimize path selection.

OSWorld-MCP

A benchmark dataset for evaluating computer use agents.

ToolCUA achieves 46.85% accuracy on this benchmark.

Multimodal Large Language Model (MLLM)

A large language model capable of processing multimodal data.

ToolCUA uses MLLMs to generate a tool library.

Warmup SFT

A supervised finetuning method used to initialize models.

ToolCUA uses warmup SFT in tool-bootstrapped GUI reinforcement finetuning.

Single-Turn RL

Reinforcement learning conducted at a single decision point.

ToolCUA uses single-turn RL to optimize decisions at critical GUI-Tool switching points.

Open Questions Unanswered questions from this research

1 How can ToolCUA improve tool invocation accuracy when tools are unavailable or unstable? Current methods perform poorly in these scenarios, requiring more robust strategies.
2 ToolCUA still relies on lengthy GUI operations in some complex tasks, failing to fully leverage tool invocation. How can path selection be further optimized in these tasks?
3 Validating ToolCUA's generalization capabilities in more diverse application scenarios is an open question. Current research focuses primarily on specific benchmark datasets.
4 How can more environmental feedback signals be incorporated to further optimize ToolCUA's path selection? Current methods rely mainly on tool-efficient path rewards.
5 How can ToolCUA's performance in cross-platform tasks be further improved? Current research shows variability in its performance across different platforms.

Applications

Immediate Applications

Desktop Automation

ToolCUA can be used for automating desktop workflows, particularly in scenarios requiring efficient path selection. Its hybrid action space makes it broadly applicable across diverse application scenarios.

Long-term Vision

Intelligent Assistant

ToolCUA can be part of an intelligent assistant, helping users complete tasks in complex digital environments, improving work efficiency.

Abstract

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

cs.AI

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

GUI (Graphical User Interface)

Tool Invocation

Path Selection

Reinforcement Learning (RL)

Interleaved Trajectory

Tool-Efficient Path Reward

OSWorld-MCP

Multimodal Large Language Model (MLLM)

Warmup SFT

Single-Turn RL

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Desktop Automation

Long-term Vision

Intelligent Assistant

Abstract

Related Papers

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

Multi-Agent Transactive Memory

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

Automated reproducibility assessments in the social and behavioral sciences using large language models

The Role of Feedback Alignment in Self-Distillation