VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

TL;DR

VISTA introduces a hybrid user simulator combining UI and API actions, with six metrics for realism and coverage, outperforming existing methods in diverse scenarios.

cs.CL 🔴 Advanced 2026-06-10 82 views

Yunan Lu Ryan Shea Yusen Zhang Zhou Yu

AI Reader Arxiv Page Download PDF

user simulation interaction evaluation multimodal interaction metric system deep learning

Key Findings

Methodology

VISTA employs a comprehensive evaluation framework comprising six core metrics: coverage, realism, cost, and failure detection. Its main innovation is a hybrid user simulator that integrates both UI-based actions and API calls, powered by large-scale pre-trained language models such as GPT-5.4 and Qwen3.5-27B. The evaluation process involves scenario generation, tool invocation, UI action prediction, and an iterative observation-planning-action loop. Specific algorithms include TransitionEntropy, ToolDistrEntropy, and Trajectory Distance, which quantify diversity, behavioral coverage, and path variation. The system constructs task scenarios in domains like e-commerce and education, then assesses performance through automated metrics and human judgment, ensuring a holistic evaluation of simulation quality.

Key Results

In e-commerce scenarios, the hybrid simulator improved coverage metrics (TransitionEntropy) by 10%, and detected 42% more unique agent failures compared to pure UI simulators, demonstrating broader exploration of agent capabilities. For example, GPT-5.4's hybrid model achieved a TransitionEntropy score of 0.34, ToolDistrEntropy of 0.62, and Trajectory Distance of 0.96, outperforming the UI-only version (TE=0.33, TDE=0.59, TD=0.94). In education settings, the hybrid simulator achieved 100% goal consistency, surpassing UI-only approaches by 6%, and exhibited higher behavioral diversity and robustness in complex tasks.
Human evaluations corroborated these findings, with the hybrid simulator rated higher in 'human-likeness,' 'coherence,' and 'goal consistency' by approximately 6%, and being 6% more often mistaken for a real user. These results validate the metrics' effectiveness and demonstrate the superior realism of the hybrid approach, especially in complex multi-step interactions.

Significance

This research advances the field of interactive agent evaluation by introducing a multimodal, hybrid simulation framework that overcomes the limitations of single-mode approaches. The six metrics provide a systematic, quantitative assessment of simulation quality, covering diversity, realism, and failure detection. The framework's ability to generate more realistic and comprehensive interactions addresses longstanding challenges in evaluating complex, multi-step systems, thereby enhancing the reliability and robustness of deployed agents. Its successful application in e-commerce and educational domains indicates broad potential for industry adoption, paving the way for standardized, automated evaluation pipelines that can keep pace with rapid model evolution.

Technical Contribution

VISTA's technical innovations include the development of a hybrid user simulator capable of executing both UI and API actions, combined with a robust metric suite for multi-dimensional evaluation. The key components involve: • Algorithms for measuring diversity (TransitionEntropy, ToolDistrEntropy) and behavioral variation (Trajectory Distance); • Integration of large language models for multi-turn, context-aware interaction generation; • An iterative observation-planning-action loop that dynamically adapts to task requirements; • Modular scenario construction with user profiles, goals, and domain knowledge, enabling scalable testing across domains. These advancements significantly improve the fidelity, coverage, and interpretability of simulation-based evaluation, surpassing prior state-of-the-art methods that relied solely on either UI or API interactions.

Novelty

This work is the first to successfully fuse UI-based and API-based user simulation into a unified, multimodal framework, enabling more realistic and versatile interaction modeling. Unlike previous approaches limited to a single interaction mode, VISTA leverages the strengths of both: the naturalness and flexibility of UI actions, and the reliability and efficiency of API calls. The introduction of a comprehensive metric suite further distinguishes this work, providing a systematic way to evaluate simulation quality across multiple dimensions. This combination of multimodal simulation and quantitative assessment represents a significant leap forward in the field, opening new avenues for scalable, high-fidelity agent evaluation.

Limitations

The reliance on large pre-trained models like GPT-5.4 entails high computational costs, which may limit deployment in resource-constrained environments. Future work could focus on model compression or distillation techniques.
While the metrics capture diversity and realism, they do not explicitly evaluate emotional or contextual nuances, which are critical for user satisfaction. Incorporating sentiment analysis and user experience metrics could be beneficial.
The current framework assumes well-defined scenarios and structured domain knowledge, which may not generalize well to unstructured or novel environments. Extending the system's adaptability to such scenarios remains an open challenge.

Future Work

Future directions include optimizing the simulator's efficiency to reduce computational overhead, integrating richer emotional and contextual understanding, and expanding multimodal capabilities to include speech and images. Additionally, validating VISTA across more diverse real-world applications—such as healthcare, finance, and social media—will be crucial. Developing standardized benchmarks and open datasets for multimodal interaction evaluation can further accelerate research. Ultimately, refining the framework to support real-time, adaptive evaluation in dynamic environments will be a key goal, pushing the boundaries of autonomous agent assessment.

AI Executive Summary

In the rapidly evolving landscape of artificial intelligence, the development and deployment of interactive agents—such as chatbots, virtual assistants, and customer service bots—have become increasingly prevalent across industries like e-commerce, education, and healthcare. However, evaluating the true capabilities and limitations of these agents remains a significant challenge. Traditional benchmark tests, often static and task-specific, fail to capture the complex, multi-step, and often unpredictable nature of real-world interactions. These methods tend to overlook failure modes such as poor action selection, tool misuse, or context misunderstanding, which are critical for ensuring reliability and user satisfaction.

To address these limitations, Yunan Lu and colleagues introduced VISTA, a versatile, multimodal user simulation toolkit designed for comprehensive evaluation of interactive agents. Unlike existing frameworks that rely solely on UI-based or API-based interactions, VISTA integrates both modalities into a hybrid simulation environment. This approach allows for more realistic, flexible, and thorough testing of agents across diverse scenarios, including e-commerce shopping and educational customer support. The core innovation lies in the design of a set of six evaluation metrics—covering aspects like capability coverage, realism, cost, and failure detection—that provide a multi-dimensional assessment of simulation quality.

At the heart of VISTA is a sophisticated simulation engine powered by large-scale pre-trained language models such as GPT-5.4 and Qwen3.5-27B. These models generate multi-turn interactions, guided by a structured scenario framework that includes user profiles, goals, and domain knowledge. The simulation operates through an iterative loop of observation, planning, and action, where the system dynamically decides whether to invoke API tools or perform UI operations based on the current context. This flexible mechanism enables the simulation to emulate complex behaviors such as navigation, tool use, and multi-step workflows, closely mirroring real user behavior.

Experimental results demonstrate that VISTA’s hybrid simulator significantly outperforms traditional UI-only approaches. In e-commerce tasks, it increased coverage metrics by up to 10%, uncovered 42% more unique agent failures, and achieved higher realism scores, including 100% goal consistency in education scenarios. Human evaluations further confirmed that the hybrid approach produces interactions that are more human-like, coherent, and goal-oriented, with a 6% higher likelihood of being mistaken for real users. These findings highlight the importance of multimodal simulation in capturing the full spectrum of user behaviors and failure modes.

The broader impact of this work lies in establishing a systematic, quantitative framework for evaluating interactive agents. By providing a set of comprehensive metrics and a flexible simulation environment, VISTA paves the way for more reliable, scalable, and realistic assessment methods. This has profound implications for deploying AI agents in real-world settings, where robustness and user trust are paramount. Looking ahead, future research will focus on reducing computational costs, expanding multimodal capabilities, and applying the framework to new domains such as healthcare and social media. Overall, VISTA represents a significant step toward automated, high-fidelity evaluation of complex interactive systems, fostering the development of more capable and trustworthy AI agents.

Deep Dive

Abstract

Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.

cs.CL

References (19)

SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Ryan Shea, Yunan Lu, Liang Qiu et al.

2025 3 citations ⭐ Influential View Analysis →

Evaluation and Benchmarking of LLM Agents: A Survey

Mahmoud Mohammadi, Yipeng Li, Jean-Pierre Lo et al.

2025 130 citations View Analysis →

Large Language Model Agents in Finance: A Survey Bridging Research, Practice, and Real-World Deployment

Yifei Dong, Fengyi Wu, Kunli Zhang et al.

2025 17 citations

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations

P. Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde et al.

2026 14 citations View Analysis →

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu et al.

2023 1440 citations View Analysis →

Mind the Sim2Real Gap in User Simulation for Agentic Tasks

Xuhui Zhou, Weiwei Sun, Qianou Ma et al.

2026 15 citations View Analysis →

Why Do Multi-Agent LLM Systems Fail?

M. Cemri, Melissa Z. Pan, Shuyi Yang et al.

2025 397 citations View Analysis →

Human vs. Agent in Task-Oriented Conversations

Zhefan Wang, N. Geng, Zhiqiang Guo et al.

2025 5 citations View Analysis →

LLM Agent Meets Agentic AI: Can LLM Agents Simulate Customers to Evaluate Agentic-AI-based Shopping Assistants?

Lu Sun, Shihan Fu, Bingsheng Yao et al.

2025 13 citations View Analysis →

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu et al.

2026 1 citations View Analysis →

CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments

Kung-Hsiang Huang, Akshara Prabhakar, Sidharth Dhawan et al.

2024 51 citations View Analysis →

Where LLM Agents Fail and How They can Learn From Failures

Kunlun Zhu, Zijia Liu, Bingxuan Li et al.

2025 60 citations View Analysis →

τ2-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray et al.

2025 263 citations View Analysis →

UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design

Yuxuan Lu, Bingsheng Yao, Hansu Gu et al.

2025 57 citations View Analysis →

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

Ming Zhu, Juntao Tan, Rithesh Murthy et al.

2026 2 citations View Analysis →

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi et al.

2024 653 citations View Analysis →

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

Harshita Chopra, Kshitish Ghate, Aylin Caliskan et al.

2026 1 citations View Analysis →

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

Yao Dou, Michel Galley, Baolin Peng et al.

2025 26 citations View Analysis →

Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li et al.

2025 174 citations View Analysis →

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (19)

Related Papers

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Learning User Simulators with Turing Rewards

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Characterizing Cultural Localization in AI-Generated Stories

Operads for compositional reasoning in LLMs