OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

TL;DR

OS-Themis framework improves GUI agent performance by 10.3% on AndroidWorld using a multi-agent critic mechanism.

cs.AI πŸ”΄ Advanced 2026-03-20 66 views
Zehao Li Zhenyu Wu Yibo Zhao Bowen Yang Jingjing Xie Zhaoyang Liu Zhoumianze Liu Kaiming Jin Jianze Liang Zonglin Li Feng Wu Bowen Zhou Zun Wang Zichen Ding
Reinforcement Learning GUI Agents Reward Function Multi-Agent System Self-Evolution

Key Findings

Methodology

OS-Themis is a multi-agent critic framework designed to enhance the robustness of GUI agents in stochastic environments. The framework decomposes trajectories into verifiable milestones and employs a review mechanism to strictly audit the evidence chain before making the final verdict. Its core components include the Milestone Verification Module and the Verdict Calibration Module, which respectively handle trajectory decomposition and evidence auditing.

Key Results

  • In experiments on AndroidWorld, OS-Themis achieved a 10.3% performance improvement when used for online RL training and a 6.9% gain for trajectory validation and filtering. These results highlight OS-Themis's potential in supporting online RL training and self-training loops.
  • On the OmniGUIRewardBench, OS-Themis outperformed all tested models, with an average accuracy increase of 18.8%, precision improvement of 29.6%, recall enhancement of 16.9%, and F1-score boost of 26.2%.
  • In RL training at different scales, OS-Themis achieved a 10.3% performance improvement on the Qwen3-VL-4B model, demonstrating its effectiveness on larger foundation models.

Significance

OS-Themis addresses the limitations of existing reward methods in scalability and performance, significantly enhancing the robustness of GUI agents in stochastic environments. Its multi-agent critic mechanism provides a novel perspective for acquiring reward signals, effectively isolating decision-critical evidence and preventing the propagation of erroneous signals. This framework holds significant academic importance and offers new possibilities for GUI agent development in industry, particularly in applications requiring high precision and robustness.

Technical Contribution

The technical contributions of OS-Themis are primarily reflected in its design of a multi-agent critic framework, which provides finer trajectory verification and evidence auditing mechanisms compared to existing single-agent methods. By introducing the Milestone Verification and Verdict Calibration Modules, OS-Themis effectively reduces erroneous judgments and improves the accuracy of reward signals. Additionally, its successful application in cross-platform GUI reward modeling demonstrates its generality and adaptability across different environments.

Novelty

OS-Themis is the first to apply a multi-agent critic mechanism to GUI reward modeling, addressing the low signal-to-noise ratio issue in long-horizon tasks by decomposing trajectories and rigorously auditing evidence. Compared to existing methods, OS-Themis not only improves the accuracy of reward signals but also prevents the propagation of erroneous signals through structured evidence chain auditing.

Limitations

  • OS-Themis may face challenges in decomposing milestones adequately in extremely complex GUI tasks, leading to insufficient verification.
  • In some cross-platform applications, OS-Themis's performance may be affected by platform-specific characteristics, resulting in inconsistent outcomes.
  • Due to the complexity of the framework, OS-Themis has high computational costs, which may not be suitable for resource-constrained environments.

Future Work

Future work can focus on optimizing the computational efficiency of OS-Themis to lower its application threshold in resource-constrained environments. Additionally, exploring its potential in more complex GUI tasks and ensuring performance consistency across different platforms are worth investigating. Further research could also explore integrating OS-Themis with other reinforcement learning methods to enhance its adaptability in dynamic environments.

AI Executive Summary

In modern digital environments, the robustness and adaptability of graphical user interface (GUI) agents are crucial. However, existing reinforcement learning methods perform poorly in stochastic environments, primarily due to the quality of the reward function. The OS-Themis framework offers an innovative solution by introducing a multi-agent critic mechanism.

At the core of OS-Themis is its multi-agent critic framework, which decomposes trajectories into verifiable milestones and employs a review mechanism to strictly audit the evidence chain before making the final verdict. Its Milestone Verification Module and Verdict Calibration Module respectively handle trajectory decomposition and evidence auditing, ensuring the accuracy of reward signals.

In experiments, OS-Themis performed exceptionally well on AndroidWorld, achieving a 10.3% performance improvement in online RL training and a 6.9% gain in trajectory validation and filtering. On the OmniGUIRewardBench, OS-Themis outperformed all tested models, demonstrating its potential in cross-platform applications.

The success of OS-Themis lies not only in its technical innovation but also in its broad applicability in academia and industry. By addressing the limitations of existing methods in scalability and performance, OS-Themis offers a new perspective for GUI agent development.

However, OS-Themis also faces challenges, such as inadequate milestone decomposition in extremely complex tasks and performance consistency issues in cross-platform applications. Future research can focus on optimizing its computational efficiency and exploring more application scenarios.

In summary, the OS-Themis framework provides an innovative solution for GUI reward modeling. Its multi-agent critic mechanism excels in improving the accuracy and robustness of reward signals, offering broad application prospects.

Deep Analysis

Background

In recent years, with the advancement of computational power and the proliferation of deep learning, graphical user interface (GUI) agents have become increasingly prevalent in digital tasks. However, despite mastering routine workflows through large-scale training, these agents still exhibit brittleness in stochastic environments, struggling to recover from deviations or generalize to unseen scenarios. This issue has prompted researchers to turn to reinforcement learning (RL) for adaptive correction. However, the success of RL heavily relies on reliable reward signals, making reward modeling a critical challenge. Existing reward acquisition methods mainly fall into three categories: rule-based methods, verifiers from human feedback, and generalized reasoning using foundational models. While each method has its pros and cons, the low signal-to-noise ratio issue in long-horizon tasks remains.

Core Problem

In GUI environments, reward modeling is crucial for the success of reinforcement learning. However, existing methods struggle with the low signal-to-noise ratio issue in long-horizon tasks, making it difficult to extract decision-critical evidence. Additionally, converting critical information in trajectories into precise rewards is challenging. Existing methods often lead to overly optimistic judgments, feeding erroneous signals into online RL and misleading policy updates. These issues make it difficult for existing reward methods to achieve both scalability and performance.

Innovation

The OS-Themis framework introduces a multi-agent critic mechanism to innovatively address the shortcomings of existing methods in reward modeling. Its core innovations include:

1) Milestone Verification Module: Decomposes trajectories into verifiable milestones, assigning explicit and observable sub-goals to effectively isolate decision-critical evidence.

2) Verdict Calibration Module: Employs a review mechanism to strictly audit the evidence chain, correcting overly optimistic assessments and preventing erroneous signal propagation.

3) Cross-platform GUI reward modeling: OS-Themis demonstrates its generality and adaptability across different environments on the OmniGUIRewardBench.

Methodology

The design of the OS-Themis framework includes the following key steps:

  • οΏ½οΏ½ Milestone Verification Module: Decomposes trajectories into milestones, assigning explicit and observable sub-goals.
  • οΏ½οΏ½ Verdict Calibration Module: Employs a review mechanism to strictly audit the evidence chain, correcting overly optimistic assessments.
  • οΏ½οΏ½ Multi-agent collaboration: Ensures the accuracy and robustness of reward signals through a collaborative workflow.
  • οΏ½οΏ½ Cross-platform application: Extensive testing on the OmniGUIRewardBench to verify its generality and adaptability across different environments.

Experiments

The experimental design includes extensive testing on AndroidWorld and the OmniGUIRewardBench. On AndroidWorld, OS-Themis achieved a 10.3% performance improvement when used for online RL training and a 6.9% gain for trajectory validation and filtering. On the OmniGUIRewardBench, OS-Themis outperformed all tested models, with an average accuracy increase of 18.8%, precision improvement of 29.6%, recall enhancement of 16.9%, and F1-score boost of 26.2%. These experimental results validate the effectiveness and adaptability of OS-Themis across different environments.

Results

The experimental results show that OS-Themis achieved a 10.3% performance improvement on AndroidWorld when used for online RL training and a 6.9% gain for trajectory validation and filtering. On the OmniGUIRewardBench, OS-Themis outperformed all tested models, with an average accuracy increase of 18.8%, precision improvement of 29.6%, recall enhancement of 16.9%, and F1-score boost of 26.2%. These results highlight OS-Themis's potential in supporting online RL training and self-training loops.

Applications

The OS-Themis framework holds broad application potential in GUI reward modeling. Its multi-agent critic mechanism effectively improves the accuracy and robustness of reward signals, making it suitable for applications requiring high precision and robustness. Additionally, its successful application in cross-platform GUI reward modeling demonstrates its generality and adaptability across different environments.

Limitations & Outlook

Despite OS-Themis's outstanding performance in reward modeling, it may face challenges in decomposing milestones adequately in extremely complex GUI tasks, leading to insufficient verification. In some cross-platform applications, OS-Themis's performance may be affected by platform-specific characteristics, resulting in inconsistent outcomes. Due to the complexity of the framework, OS-Themis has high computational costs, which may not be suitable for resource-constrained environments. Future research can focus on optimizing its computational efficiency and exploring more application scenarios.

Plain Language Accessible to non-experts

Imagine you're in a kitchen cooking a meal. Each dish has several key steps, like chopping vegetables, frying, and seasoning. OS-Themis is like a smart kitchen assistant that breaks down the entire cooking process into these key steps and checks each step to ensure it's done correctly. Even if you make a small mistake during cooking, it can catch and correct it, ensuring the final dish is delicious. This assistant can work in different kitchens, like Chinese, Western, and Japanese cuisines. It can adjust its working style according to different cuisines to ensure each dish reaches its best flavor. The multi-agent system of OS-Themis is like multiple kitchen assistants working together, each with its expertise, ensuring the entire process is efficient and accurate. Even in complex dishes, it can ensure each step is completed smoothly through division of labor and cooperation.

ELI14 Explained like you're 14

Hey there! Did you know that sometimes computer programs are like playing games, needing to react in different environments? Imagine you're playing a super complex game, and each level has many small tasks, like finding keys, opening doors, and defeating monsters. OS-Themis is like a super smart game assistant that helps you break each level into these small tasks and guides you step by step. Even if you get lost in a level, it can help you find the right direction. This assistant can help you in different games, like action, puzzle, and adventure games. It can adjust its strategy according to different game types to ensure you can pass smoothly. The multi-agent system of OS-Themis is like multiple game assistants working together, each with its expertise, ensuring the entire game process is efficient and accurate. Even in super complex games, it can ensure each task is completed smoothly through division of labor and cooperation.

Glossary

OS-Themis

OS-Themis is a multi-agent critic framework designed to enhance the robustness of GUI agents in stochastic environments.

In this paper, OS-Themis is used to optimize the acquisition of GUI reward signals.

Multi-Agent System

A multi-agent system is a collaborative workflow where multiple agents work together to complete tasks.

OS-Themis achieves precise reward signal acquisition through a multi-agent system.

Milestone Verification Module

The Milestone Verification Module is responsible for decomposing trajectories into verifiable milestones, ensuring the isolation of decision-critical evidence.

In OS-Themis, this module is used to improve the accuracy of reward signals.

Verdict Calibration Module

The Verdict Calibration Module employs a review mechanism to strictly audit the evidence chain, correcting overly optimistic assessments.

In OS-Themis, this module prevents the propagation of erroneous signals.

OmniGUIRewardBench

OmniGUIRewardBench is a cross-platform GUI reward model benchmark used to evaluate the performance of different models.

In this paper, OS-Themis performed exceptionally well on OmniGUIRewardBench.

Reinforcement Learning

Reinforcement learning is a machine learning method that optimizes strategies through reward signals.

In this paper, reinforcement learning is used to enhance the robustness of GUI agents.

Reward Function

A reward function is a key component in reinforcement learning used to guide strategy optimization.

In this paper, OS-Themis optimizes the reward function through a multi-agent critic mechanism.

Signal-to-Noise Ratio

Signal-to-noise ratio is a measure of signal quality, with lower ratios potentially leading to erroneous judgments.

In long-horizon tasks, the low signal-to-noise ratio issue in existing methods is addressed by OS-Themis.

Cross-Platform Application

Cross-platform application refers to achieving consistent performance across different platforms.

OS-Themis demonstrates its potential for cross-platform application on OmniGUIRewardBench.

Self-Evolution

Self-evolution refers to a system's ability to continuously improve performance through its learning and adaptation capabilities.

OS-Themis achieves self-evolution of GUI agents through the optimization of reward signals.

Open Questions Unanswered questions from this research

  • 1 In extremely complex GUI tasks, OS-Themis may face challenges in adequately decomposing milestones, leading to insufficient verification. This issue requires further research to enhance its adaptability in complex tasks.
  • 2 In cross-platform applications, OS-Themis's performance may be affected by platform-specific characteristics, resulting in inconsistent outcomes. Future research can focus on improving its performance consistency across platforms.
  • 3 OS-Themis has high computational costs, which may not be suitable for resource-constrained environments. Optimizing its computational efficiency is an important direction for future research.
  • 4 The existing multi-agent critic mechanism may not completely isolate decision-critical evidence in some cases, leading to the propagation of erroneous signals. Further research can focus on enhancing its evidence isolation capability.
  • 5 In dynamic environments, OS-Themis's adaptability may be limited. Future research can explore how to integrate it with other reinforcement learning methods to enhance its adaptability.

Applications

Immediate Applications

Mobile Application Testing

OS-Themis can be used to enhance the robustness and accuracy of GUI agents in mobile application testing, ensuring consistency across different devices and environments.

Cross-Platform Software Development

In cross-platform software development, OS-Themis can help developers optimize GUI reward signals, improving performance consistency across different platforms.

Automated User Interface Design

OS-Themis can be used in automated user interface design for reward modeling, helping designers optimize interface layout and user experience.

Long-term Vision

Smart Home Systems

OS-Themis can be used to optimize GUI agents in smart home systems, enhancing adaptability and robustness across different environments.

Autonomous Vehicle Interfaces

In autonomous vehicles, OS-Themis can be used to optimize in-car interface reward signals, improving driving safety and user experience.

Abstract

Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.

cs.AI

References (20)

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Zhaoyang Liu, Jingjing Xie, Zichen Ding et al.

2025 17 citations ⭐ Influential View Analysis β†’

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Hao Bai, Yifei Zhou, M. Cemri et al.

2024 149 citations ⭐ Influential View Analysis β†’

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi, Xiao Liu, Iat Long Iong et al.

2024 143 citations ⭐ Influential View Analysis β†’

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.

2025 70 citations ⭐ Influential View Analysis β†’

Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents

Tianyi Men, Zhuoran Jin, Pengfei Cao et al.

2025 11 citations View Analysis β†’

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu et al.

2024 412 citations View Analysis β†’

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye et al.

2024 1319 citations View Analysis β†’

OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows

Qiushi Sun, Mukai Li, Zhoumianze Liu et al.

2025 5 citations View Analysis β†’

Decoupled Weight Decay Regularization

I. Loshchilov, F. Hutter

2017 31820 citations

DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents

Taiyi Wang, Zhihao Wu, Jianheng Liu et al.

2024 58 citations View Analysis β†’

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Zeyi Sun, Ziyu Liu, Yuhang Zang et al.

2025 28 citations View Analysis β†’

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song et al.

2025 85 citations View Analysis β†’

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Bowen Yang, Kaiming Jin, Zhenyu Wu et al.

2026 4 citations View Analysis β†’

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

Zhengxi Lu, Jiabo Ye, Fei Tang et al.

2025 5 citations View Analysis β†’

Claude 3.7 Sonnet System Card

121 citations

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu et al.

2024 251 citations View Analysis β†’

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Xuehui Wang, Zhenyu Wu, Jingjing Xie et al.

2025 28 citations View Analysis β†’

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Jiabo Ye, Xi Zhang, Haiyang Xu et al.

2025 73 citations View Analysis β†’

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Haitao Li, Qian Dong, Junjie Chen et al.

2024 380 citations View Analysis β†’

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Saaket Agashe, Jiuzhou Han, Shuyu Gan et al.

2024 119 citations View Analysis β†’