OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
OS-Themis framework improves GUI agent performance by 10.3% on AndroidWorld using a multi-agent critic mechanism.
Key Findings
Methodology
OS-Themis is a multi-agent critic framework designed to enhance the robustness of GUI agents in stochastic environments. The framework decomposes trajectories into verifiable milestones and employs a review mechanism to strictly audit the evidence chain before making the final verdict. Its core components include the Milestone Verification Module and the Verdict Calibration Module, which respectively handle trajectory decomposition and evidence auditing.
Key Results
- In experiments on AndroidWorld, OS-Themis achieved a 10.3% performance improvement when used for online RL training and a 6.9% gain for trajectory validation and filtering. These results highlight OS-Themis's potential in supporting online RL training and self-training loops.
- On the OmniGUIRewardBench, OS-Themis outperformed all tested models, with an average accuracy increase of 18.8%, precision improvement of 29.6%, recall enhancement of 16.9%, and F1-score boost of 26.2%.
- In RL training at different scales, OS-Themis achieved a 10.3% performance improvement on the Qwen3-VL-4B model, demonstrating its effectiveness on larger foundation models.
Significance
OS-Themis addresses the limitations of existing reward methods in scalability and performance, significantly enhancing the robustness of GUI agents in stochastic environments. Its multi-agent critic mechanism provides a novel perspective for acquiring reward signals, effectively isolating decision-critical evidence and preventing the propagation of erroneous signals. This framework holds significant academic importance and offers new possibilities for GUI agent development in industry, particularly in applications requiring high precision and robustness.
Technical Contribution
The technical contributions of OS-Themis are primarily reflected in its design of a multi-agent critic framework, which provides finer trajectory verification and evidence auditing mechanisms compared to existing single-agent methods. By introducing the Milestone Verification and Verdict Calibration Modules, OS-Themis effectively reduces erroneous judgments and improves the accuracy of reward signals. Additionally, its successful application in cross-platform GUI reward modeling demonstrates its generality and adaptability across different environments.
Novelty
OS-Themis is the first to apply a multi-agent critic mechanism to GUI reward modeling, addressing the low signal-to-noise ratio issue in long-horizon tasks by decomposing trajectories and rigorously auditing evidence. Compared to existing methods, OS-Themis not only improves the accuracy of reward signals but also prevents the propagation of erroneous signals through structured evidence chain auditing.
Limitations
- OS-Themis may face challenges in decomposing milestones adequately in extremely complex GUI tasks, leading to insufficient verification.
- In some cross-platform applications, OS-Themis's performance may be affected by platform-specific characteristics, resulting in inconsistent outcomes.
- Due to the complexity of the framework, OS-Themis has high computational costs, which may not be suitable for resource-constrained environments.
Future Work
Future work can focus on optimizing the computational efficiency of OS-Themis to lower its application threshold in resource-constrained environments. Additionally, exploring its potential in more complex GUI tasks and ensuring performance consistency across different platforms are worth investigating. Further research could also explore integrating OS-Themis with other reinforcement learning methods to enhance its adaptability in dynamic environments.
AI Executive Summary
In modern digital environments, the robustness and adaptability of graphical user interface (GUI) agents are crucial. However, existing reinforcement learning methods perform poorly in stochastic environments, primarily due to the quality of the reward function. The OS-Themis framework offers an innovative solution by introducing a multi-agent critic mechanism.
At the core of OS-Themis is its multi-agent critic framework, which decomposes trajectories into verifiable milestones and employs a review mechanism to strictly audit the evidence chain before making the final verdict. Its Milestone Verification Module and Verdict Calibration Module respectively handle trajectory decomposition and evidence auditing, ensuring the accuracy of reward signals.
In experiments, OS-Themis performed exceptionally well on AndroidWorld, achieving a 10.3% performance improvement in online RL training and a 6.9% gain in trajectory validation and filtering. On the OmniGUIRewardBench, OS-Themis outperformed all tested models, demonstrating its potential in cross-platform applications.
The success of OS-Themis lies not only in its technical innovation but also in its broad applicability in academia and industry. By addressing the limitations of existing methods in scalability and performance, OS-Themis offers a new perspective for GUI agent development.
However, OS-Themis also faces challenges, such as inadequate milestone decomposition in extremely complex tasks and performance consistency issues in cross-platform applications. Future research can focus on optimizing its computational efficiency and exploring more application scenarios.
In summary, the OS-Themis framework provides an innovative solution for GUI reward modeling. Its multi-agent critic mechanism excels in improving the accuracy and robustness of reward signals, offering broad application prospects.
Deep Analysis
Background
In recent years, with the advancement of computational power and the proliferation of deep learning, graphical user interface (GUI) agents have become increasingly prevalent in digital tasks. However, despite mastering routine workflows through large-scale training, these agents still exhibit brittleness in stochastic environments, struggling to recover from deviations or generalize to unseen scenarios. This issue has prompted researchers to turn to reinforcement learning (RL) for adaptive correction. However, the success of RL heavily relies on reliable reward signals, making reward modeling a critical challenge. Existing reward acquisition methods mainly fall into three categories: rule-based methods, verifiers from human feedback, and generalized reasoning using foundational models. While each method has its pros and cons, the low signal-to-noise ratio issue in long-horizon tasks remains.
Core Problem
In GUI environments, reward modeling is crucial for the success of reinforcement learning. However, existing methods struggle with the low signal-to-noise ratio issue in long-horizon tasks, making it difficult to extract decision-critical evidence. Additionally, converting critical information in trajectories into precise rewards is challenging. Existing methods often lead to overly optimistic judgments, feeding erroneous signals into online RL and misleading policy updates. These issues make it difficult for existing reward methods to achieve both scalability and performance.
Innovation
The OS-Themis framework introduces a multi-agent critic mechanism to innovatively address the shortcomings of existing methods in reward modeling. Its core innovations include:
1) Milestone Verification Module: Decomposes trajectories into verifiable milestones, assigning explicit and observable sub-goals to effectively isolate decision-critical evidence.
2) Verdict Calibration Module: Employs a review mechanism to strictly audit the evidence chain, correcting overly optimistic assessments and preventing erroneous signal propagation.
3) Cross-platform GUI reward modeling: OS-Themis demonstrates its generality and adaptability across different environments on the OmniGUIRewardBench.
Methodology
The design of the OS-Themis framework includes the following key steps:
- οΏ½οΏ½ Milestone Verification Module: Decomposes trajectories into milestones, assigning explicit and observable sub-goals.
- οΏ½οΏ½ Verdict Calibration Module: Employs a review mechanism to strictly audit the evidence chain, correcting overly optimistic assessments.
- οΏ½οΏ½ Multi-agent collaboration: Ensures the accuracy and robustness of reward signals through a collaborative workflow.
- οΏ½οΏ½ Cross-platform application: Extensive testing on the OmniGUIRewardBench to verify its generality and adaptability across different environments.
Experiments
The experimental design includes extensive testing on AndroidWorld and the OmniGUIRewardBench. On AndroidWorld, OS-Themis achieved a 10.3% performance improvement when used for online RL training and a 6.9% gain for trajectory validation and filtering. On the OmniGUIRewardBench, OS-Themis outperformed all tested models, with an average accuracy increase of 18.8%, precision improvement of 29.6%, recall enhancement of 16.9%, and F1-score boost of 26.2%. These experimental results validate the effectiveness and adaptability of OS-Themis across different environments.
Results
The experimental results show that OS-Themis achieved a 10.3% performance improvement on AndroidWorld when used for online RL training and a 6.9% gain for trajectory validation and filtering. On the OmniGUIRewardBench, OS-Themis outperformed all tested models, with an average accuracy increase of 18.8%, precision improvement of 29.6%, recall enhancement of 16.9%, and F1-score boost of 26.2%. These results highlight OS-Themis's potential in supporting online RL training and self-training loops.
Applications
The OS-Themis framework holds broad application potential in GUI reward modeling. Its multi-agent critic mechanism effectively improves the accuracy and robustness of reward signals, making it suitable for applications requiring high precision and robustness. Additionally, its successful application in cross-platform GUI reward modeling demonstrates its generality and adaptability across different environments.
Limitations & Outlook
Despite OS-Themis's outstanding performance in reward modeling, it may face challenges in decomposing milestones adequately in extremely complex GUI tasks, leading to insufficient verification. In some cross-platform applications, OS-Themis's performance may be affected by platform-specific characteristics, resulting in inconsistent outcomes. Due to the complexity of the framework, OS-Themis has high computational costs, which may not be suitable for resource-constrained environments. Future research can focus on optimizing its computational efficiency and exploring more application scenarios.
Plain Language Accessible to non-experts
Imagine you're in a kitchen cooking a meal. Each dish has several key steps, like chopping vegetables, frying, and seasoning. OS-Themis is like a smart kitchen assistant that breaks down the entire cooking process into these key steps and checks each step to ensure it's done correctly. Even if you make a small mistake during cooking, it can catch and correct it, ensuring the final dish is delicious. This assistant can work in different kitchens, like Chinese, Western, and Japanese cuisines. It can adjust its working style according to different cuisines to ensure each dish reaches its best flavor. The multi-agent system of OS-Themis is like multiple kitchen assistants working together, each with its expertise, ensuring the entire process is efficient and accurate. Even in complex dishes, it can ensure each step is completed smoothly through division of labor and cooperation.
ELI14 Explained like you're 14
Hey there! Did you know that sometimes computer programs are like playing games, needing to react in different environments? Imagine you're playing a super complex game, and each level has many small tasks, like finding keys, opening doors, and defeating monsters. OS-Themis is like a super smart game assistant that helps you break each level into these small tasks and guides you step by step. Even if you get lost in a level, it can help you find the right direction. This assistant can help you in different games, like action, puzzle, and adventure games. It can adjust its strategy according to different game types to ensure you can pass smoothly. The multi-agent system of OS-Themis is like multiple game assistants working together, each with its expertise, ensuring the entire game process is efficient and accurate. Even in super complex games, it can ensure each task is completed smoothly through division of labor and cooperation.
Glossary
OS-Themis
OS-Themis is a multi-agent critic framework designed to enhance the robustness of GUI agents in stochastic environments.
In this paper, OS-Themis is used to optimize the acquisition of GUI reward signals.
Multi-Agent System
A multi-agent system is a collaborative workflow where multiple agents work together to complete tasks.
OS-Themis achieves precise reward signal acquisition through a multi-agent system.
Milestone Verification Module
The Milestone Verification Module is responsible for decomposing trajectories into verifiable milestones, ensuring the isolation of decision-critical evidence.
In OS-Themis, this module is used to improve the accuracy of reward signals.
Verdict Calibration Module
The Verdict Calibration Module employs a review mechanism to strictly audit the evidence chain, correcting overly optimistic assessments.
In OS-Themis, this module prevents the propagation of erroneous signals.
OmniGUIRewardBench
OmniGUIRewardBench is a cross-platform GUI reward model benchmark used to evaluate the performance of different models.
In this paper, OS-Themis performed exceptionally well on OmniGUIRewardBench.
Reinforcement Learning
Reinforcement learning is a machine learning method that optimizes strategies through reward signals.
In this paper, reinforcement learning is used to enhance the robustness of GUI agents.
Reward Function
A reward function is a key component in reinforcement learning used to guide strategy optimization.
In this paper, OS-Themis optimizes the reward function through a multi-agent critic mechanism.
Signal-to-Noise Ratio
Signal-to-noise ratio is a measure of signal quality, with lower ratios potentially leading to erroneous judgments.
In long-horizon tasks, the low signal-to-noise ratio issue in existing methods is addressed by OS-Themis.
Cross-Platform Application
Cross-platform application refers to achieving consistent performance across different platforms.
OS-Themis demonstrates its potential for cross-platform application on OmniGUIRewardBench.
Self-Evolution
Self-evolution refers to a system's ability to continuously improve performance through its learning and adaptation capabilities.
OS-Themis achieves self-evolution of GUI agents through the optimization of reward signals.
Open Questions Unanswered questions from this research
- 1 In extremely complex GUI tasks, OS-Themis may face challenges in adequately decomposing milestones, leading to insufficient verification. This issue requires further research to enhance its adaptability in complex tasks.
- 2 In cross-platform applications, OS-Themis's performance may be affected by platform-specific characteristics, resulting in inconsistent outcomes. Future research can focus on improving its performance consistency across platforms.
- 3 OS-Themis has high computational costs, which may not be suitable for resource-constrained environments. Optimizing its computational efficiency is an important direction for future research.
- 4 The existing multi-agent critic mechanism may not completely isolate decision-critical evidence in some cases, leading to the propagation of erroneous signals. Further research can focus on enhancing its evidence isolation capability.
- 5 In dynamic environments, OS-Themis's adaptability may be limited. Future research can explore how to integrate it with other reinforcement learning methods to enhance its adaptability.
Applications
Immediate Applications
Mobile Application Testing
OS-Themis can be used to enhance the robustness and accuracy of GUI agents in mobile application testing, ensuring consistency across different devices and environments.
Cross-Platform Software Development
In cross-platform software development, OS-Themis can help developers optimize GUI reward signals, improving performance consistency across different platforms.
Automated User Interface Design
OS-Themis can be used in automated user interface design for reward modeling, helping designers optimize interface layout and user experience.
Long-term Vision
Smart Home Systems
OS-Themis can be used to optimize GUI agents in smart home systems, enhancing adaptability and robustness across different environments.
Autonomous Vehicle Interfaces
In autonomous vehicles, OS-Themis can be used to optimize in-car interface reward signals, improving driving safety and user experience.
Abstract
Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.
References (20)
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Zhaoyang Liu, Jingjing Xie, Zichen Ding et al.
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
Hao Bai, Yifei Zhou, M. Cemri et al.
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Zehan Qi, Xiao Liu, Iat Long Iong et al.
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.
Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents
Tianyi Men, Zhuoran Jin, Pengfei Cao et al.
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu et al.
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Qiushi Sun, Mukai Li, Zhoumianze Liu et al.
Decoupled Weight Decay Regularization
I. Loshchilov, F. Hutter
DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
Taiyi Wang, Zhihao Wu, Jianheng Liu et al.
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
Zeyi Sun, Ziyu Liu, Yuhang Zang et al.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Haoming Wang, Haoyang Zou, Huatong Song et al.
OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
Bowen Yang, Kaiming Jin, Zhenyu Wu et al.
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
Zhengxi Lu, Jiabo Ye, Fei Tang et al.
Claude 3.7 Sonnet System Card
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu et al.
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
Xuehui Wang, Zhenyu Wu, Jingjing Xie et al.
Mobile-Agent-v3: Fundamental Agents for GUI Automation
Jiabo Ye, Xi Zhang, Haiyang Xu et al.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Haitao Li, Qian Dong, Junjie Chen et al.
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Saaket Agashe, Jiuzhou Han, Shuyu Gan et al.