A History-Aware Visually Grounded Critic for Computer Use Agents

TL;DR

Proposes HiViG, a history-aware visually grounded test-time framework, boosting GUI task success rates by 5.8% (Qwen3-VL-32B) and 9% (Gemini-3-Flash) through macro-action history and visual error verification.

cs.AI 🔴 Advanced 2026-06-10 94 views

Jaewoo Lee Zaid Khan Archiki Prasad Justin Chih-Yao Chen Supriyo Chakraborty Kartik Balasubramaniam Sambit Sahu Elias Stengel-Eskin Hyunji Lee Mohit Bansal

AI Reader Arxiv Page Download PDF

AI Multimodal Learning GUI Automation Test-time Intervention Deep Learning

Key Findings

Methodology

This work introduces HiViG, a multimodal critic framework that integrates long-term macro-action history tracking with real-time visual grounded error analysis to enhance the robustness of GUI agents. The critic is trained on a large corpus of open-source GUI trajectories, employing multi-stage data augmentation including macro-action compression and spatial verification labels. Its architecture is based on a multi-modal Transformer that fuses visual features from screenshots with textual descriptions of actions, enabling it to predict state transitions and classify potential errors. During inference, HiViG embeds the critic into the policy decision loop, where it updates macro-action history by recursively compressing past interactions, verifies proposed actions via pixel-level spatial validation against current screenshots, and provides corrective feedback before execution. Extensive evaluations across web, mobile, and desktop benchmarks demonstrate significant performance improvements over scalar reward models and ungrounded critics, with success rate gains of 5.8% and 9% respectively.

Key Results

On WebArenaLitev2, Gemini-3-Flash's success rate increased from 30.5% to 45.5%, a 15% absolute improvement; Qwen3-VL-32B improved by 5.8%, reaching 38.3%.
In mobile environments, Qwen's success rate increased by 7.3%, and on desktop, the success rate improved by 2.3%.
Ablation studies confirmed that macro-action history mitigates short-sighted planning, while visual grounded error analysis reduces spatial and reasoning errors, both critical for long-horizon tasks.

Significance

This research addresses fundamental limitations in GUI automation—short-sighted decision-making and spatial error detection—by integrating history-aware and visually grounded mechanisms. It advances the state-of-the-art in test-time intervention, enabling more reliable and interpretable AI agents capable of handling complex, multi-step tasks across diverse platforms. The framework's platform-agnostic design, relying solely on raw screenshots and pixel-level verification, broadens its applicability in real-world automation scenarios, from web browsing to enterprise software automation. The approach paves the way for more robust, scalable, and explainable AI systems in GUI environments, with potential impacts spanning industry automation, assistive technologies, and intelligent testing.

Technical Contribution

The core technical innovation lies in the development of a multimodal critic that combines macro-action history compression with spatial verification. The macro-action history, recursively distilled from raw interactions, provides a global task context, alleviating short-sighted planning. The visual verification module compares proposed execution coordinates against current screenshots, predicting the visual impact of actions and classifying errors before execution. The architecture employs a multi-stage training pipeline: supervised fine-tuning on a large-scale GUI trajectory dataset, incorporating multi-task objectives for state transition prediction and error classification. The critic's design enables real-time, preemptive error detection and correction, significantly improving task success rates over existing scalar reward and text-based critics.

Novelty

This work is the first to systematically combine long-term macro-action history with real-time spatial verification within a single multimodal critic for GUI agents. Unlike prior approaches that rely solely on textual intent or scalar rewards, HiViG leverages visual grounding to verify spatial accuracy, and compresses interaction history to maintain global task awareness. This dual mechanism addresses critical gaps in existing methods, enabling preemptive error detection and correction in complex, multi-step environments. The integration of these components into a unified framework represents a significant leap forward in test-time intervention strategies for GUI automation.

Limitations

Despite its robustness, the critic may still struggle with highly cluttered or occluded UI elements, where visual verification becomes ambiguous. In such cases, errors might slip through or false positives may occur.
The training data, derived from open-source trajectories, could contain biases or gaps, limiting generalization to unseen or highly customized GUIs.
The computational overhead of multi-stage visual verification and macro-action compression may hinder real-time deployment in resource-constrained environments, necessitating further optimization.

Future Work

Future research will focus on enhancing the spatial verification module's robustness in cluttered or occluded scenarios, possibly through multi-view or temporal consistency checks. Incorporating reinforcement learning to adaptively refine error detection thresholds and correction strategies is another promising avenue. Additionally, efforts will be made to optimize the model architecture for faster inference, enabling deployment in real-time industrial settings. Extending the framework to handle multi-modal inputs beyond screenshots, such as audio or textual cues, could further broaden its applicability. Finally, integrating user feedback mechanisms to dynamically adapt the critic's judgment criteria will improve its robustness and user trust.

AI Executive Summary

In the rapidly evolving landscape of AI-driven automation, the ability of intelligent agents to reliably operate within complex graphical user interfaces (GUIs) remains a significant challenge. Traditional approaches, relying on rule-based systems or scalar reward signals, often fall short in long-horizon, multi-step tasks where errors can compound, and spatial reasoning is critical. These limitations hinder the deployment of autonomous systems in real-world scenarios such as web navigation, mobile app control, and desktop automation.

Recent advances in deep learning and multimodal modeling have opened new possibilities for GUI automation, but existing methods still struggle with short-sighted decision loops and spatial misalignments. Text-based critiques or scalar rewards lack the spatial grounding necessary to detect and correct errors proactively. Consequently, agents frequently make mistakes—clicking the wrong UI elements, revisiting completed steps, or losing track of overall progress—leading to low success rates and unreliable performance.

To address these issues, the authors introduce HiViG, a novel framework that integrates a history-aware, visually grounded critic into the test-time intervention process. The core idea is to equip the agent with two key capabilities: first, a macro-action history that compresses and summarizes past interactions, providing a global view of task progress; second, a spatial verification mechanism that compares proposed action coordinates with the current GUI screenshot, predicting the visual impact and identifying potential errors before execution.

The critic is built on a multi-modal Transformer architecture trained on a large corpus of open-source GUI trajectories. It learns to update macro-action histories recursively, capturing long-term goals, and to verify the spatial correctness of actions by predicting state transitions grounded in visual features. During inference, HiViG actively guides the policy by updating the macro-action history, verifying proposed actions, and providing corrective feedback if errors are detected. This preemptive approach significantly reduces execution errors and improves overall success rates.

Extensive experiments across web, mobile, and desktop benchmarks demonstrate the effectiveness of HiViG. In the WebArenaLitev2 environment, the success rate of Gemini-3-Flash increased from 30.5% to 45.5%, a 15% absolute gain. Similarly, Qwen3-VL-32B's success rate improved by 5.8%. In mobile and desktop settings, success rate improvements ranged from 2.3% to 7.3%. Ablation studies confirmed that both macro-action history and visual verification are essential components, with their combination yielding the highest performance. The framework outperforms existing scalar reward models and ungrounded critics, showcasing strong cross-platform generalization.

This work marks a significant step forward in making AI agents more reliable and interpretable in GUI automation tasks. By integrating long-term memory and spatial reasoning, HiViG addresses core limitations of prior methods, paving the way for more robust industrial automation, assistive technologies, and intelligent testing systems. Future directions include optimizing inference speed, enhancing robustness in cluttered environments, and extending the framework to multi-modal inputs, aiming to realize fully autonomous, adaptable GUI agents capable of operating seamlessly across diverse real-world scenarios.

Deep Dive

Plain Language Accessible to non-experts

想象你在一家工厂工作，工厂里有许多机器和流水线。每次你需要让某个机器做事，比如装配零件，你会告诉它具体的步骤。有时候，机器会出错，比如装错零件或忘记了之前的步骤。为了避免这些错误，你会不断检查机器的状态，确保每一步都正确完成。

现在，假设你有一个非常聪明的助手，它不仅记住你所有的操作，还能用眼睛观察工厂的实时画面，确认每个动作是否正确。比如，它会看一看机器是否在正确的位置，是否装上了正确的零件。如果发现错误，它会提前告诉你，让你可以及时修正，而不是等到出错后再去补救。

这个助手就像是一个非常聪明的工厂经理，既能记住所有的操作历史，又能用眼睛看清每个细节，确保每个环节都顺利进行。这样一来，整个生产流程就变得更加高效和可靠，不会因为小错误而影响整个生产线的运行。

ELI14 Explained like you're 14

想象你在玩一个超级复杂的拼图游戏，你需要把很多不同的拼图块拼在一起，才能完成一幅漂亮的画。有时候，你会忘记自己已经拼过哪些部分，或者拼错了位置，导致整个拼图变得乱糟糟的。

现在，假设你有一个聪明的助手，它不仅记住你每一步拼图的过程，还能用眼睛看一看拼图的当前状态，帮你检查拼图是否正确。比如，它会告诉你：“这个拼图块放错地方了，要不你试试换个位置。”

这个助手还会提前告诉你可能会出错的地方，比如说：“这个拼图块看起来不太对，可能会让整个拼图变得不稳定。”这样，你就可以在出错之前修正，避免浪费时间。

就像这样，这个智能助手既能记住你的所有操作，还能用眼睛观察当前的情况，提前发现问题，帮你把拼图拼得又快又好。它让拼图变得更容易，也让你玩得更开心！

Abstract

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

cs.AI cs.CL cs.CV

References (20)

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Zhaoyang Liu, Jingjing Xie, Zichen Ding et al.

2025 34 citations ⭐ Influential View Analysis →

NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

Luke Rivard, Sun Sun, Hongyu Guo et al.

2025 7 citations ⭐ Influential View Analysis →

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang et al.

2025 457 citations ⭐ Influential View Analysis →

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Yifan Xu, Xiao Liu, Xueqiao Sun et al.

2024 76 citations ⭐ Influential View Analysis →

OpenCUA: Open Foundations for Computer-Use Agents

Xinyuan Wang, Bowen Wang, Dunjie Lu et al.

2025 91 citations ⭐ Influential View Analysis →

Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation

Yuyang Wanyan, Xi Zhang, Haiyang Xu et al.

2025 23 citations ⭐ Influential View Analysis →

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang et al.

2024 1706 citations View Analysis →

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li et al.

2023 424 citations

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Haiyang Xu, Xi Zhang, Hao Liu et al.

2026 25 citations View Analysis →

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Bowen Yang, Kaiming Jin, Zhenyu Wu et al.

2026 13 citations View Analysis →

Digi-Q: Learning VLM Q-Value Functions for Training Device-Control Agents

Hao Bai, Yifei Zhou, Erran L. Li et al.

2025 12 citations

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Zeyi Sun, Ziyu Liu, Yuhang Zang et al.

2025 44 citations View Analysis →

Jonathan

J. Stevenson

2020 350 citations

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang et al.

2024 319 citations View Analysis →

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong et al.

2024 96 citations View Analysis →

WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

Yao Zhang, Shijie Tang, Zeyu Li et al.

2026 3 citations View Analysis →

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Renjie Luo, Zi-Yan Liu, Xiangyan Liu et al.

2025 13 citations View Analysis →

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Kaixin Li, Ziyang Meng, Hongzhan Lin et al.

2025 216 citations View Analysis →

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu et al.

2023 1440 citations View Analysis →

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Carlos E. Jimenez, K. Lieret, Karthik R. Narasimhan et al.

2024 228 citations

A History-Aware Visually Grounded Critic for Computer Use Agents

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

Multi-Agent Transactive Memory

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

Automated reproducibility assessments in the social and behavioral sciences using large language models

The Role of Feedback Alignment in Self-Distillation