Learning User Simulators with Turing Rewards
Proposes Turing-RL, a reinforcement learning approach using discriminative Turing rewards to train human user simulators, outperforming traditional response matching methods.
Key Findings
Methodology
This paper introduces a novel framework called Turing-RL, which employs a discriminative Turing reward mechanism. A large language model (LLM) acts as a judge, scoring generated responses based on their indistinguishability from real user responses conditioned on user history and context. The training involves an initial supervised fine-tuning (SFT) phase, followed by reinforcement learning with Group Relative Policy Optimization (GRPO). The model generates multiple candidate responses via chain-of-thought reasoning, which are then evaluated by the judge. The reward signal guides the model to produce responses that are more human-like, focusing on indistinguishability rather than content matching. Experiments across multi-turn dialogue datasets (PRISM) and Reddit forum discussions (ConvoKit) demonstrate that Turing-RL consistently outperforms baseline methods like response similarity reward (Sim-RL) and log-probability maximization, achieving higher human-likeness scores and better content grounding.
Key Results
- In multi-turn dialogue experiments, Turing-RL achieved an average judge score of 5.3 on a 1-7 scale, significantly higher than Sim-RL's 4.7 and SFT-Init's 4.2, representing over 13% improvement. On Reddit discussions, it maintained superior performance, indicating robustness across domains.
- Content similarity metrics showed that Turing-RL responses matched ground truth responses at over 78%, comparable to Sim-RL, but with a marked improvement in human-likeness scores, indicating responses are both content-consistent and more human-like.
- Human evaluation revealed that responses from Turing-RL were identified as real users 57% of the time, outperforming SFT-Init (49%), confirming its effectiveness in generating natural, human-like responses in practical scenarios.
Significance
This work advances the field by shifting the training objective from response matching to indistinguishability, addressing the core challenge of creating realistic user simulators. Such models can significantly improve training of conversational agents, enhance personalization systems, and provide new tools for social science research. The approach bridges the gap between content fidelity and behavioral realism, enabling more natural and diverse interactions. Its success across different domains suggests broad applicability, paving the way for more human-like AI systems that can better understand and emulate human behaviors, ultimately leading to more seamless human-AI interactions.
Technical Contribution
The main technical innovation lies in integrating a discriminative Turing reward with reinforcement learning, specifically using GRPO to optimize the user simulator. Unlike traditional maximum likelihood or similarity-based rewards, this method directly targets the indistinguishability from real users, leveraging a large-scale LLM judge. The framework employs chain-of-thought prompting for response generation, enabling logical and contextually rich responses. The combination of supervised fine-tuning and RL fine-tuning ensures stable training and improved generalization. The experimental setup includes datasets like PRISM and ConvoKit, with evaluation metrics encompassing judge scores, content similarity, and human judgments, demonstrating the model’s superior performance.
Novelty
This research is the first to formalize a reinforcement learning framework centered on the concept of indistinguishability from human responses, operationalized through a discriminative Turing reward. Unlike prior work focused solely on content matching or likelihood maximization, this approach emphasizes behavioral realism, capturing the nuanced variability of human responses. The integration of a large language model as a judge in an end-to-end training loop represents a significant methodological leap, enabling models to learn more authentic and diverse behaviors. The framework's applicability across different domains underscores its versatility and potential to redefine user simulation paradigms.
Limitations
- Despite promising results, the models still struggle with highly unpredictable or rare user behaviors, possibly due to biases in the judge model or training data limitations. This affects the diversity and robustness of the simulations.
- Training requires substantial computational resources, especially for large models and multiple evaluation steps, which may hinder large-scale deployment or real-time applications.
- While the approach improves human-likeness, it may still generate responses that lack true personalization or emotional depth, indicating room for integrating multimodal cues and richer user profiles.
Future Work
Future research will explore incorporating multimodal data, such as speech and visual cues, to enhance behavioral realism. Efforts will also focus on reducing training costs through model compression and more efficient algorithms. Additionally, extending the framework to multilingual and cross-cultural settings will be crucial for broader applicability. The authors aim to develop adaptive models that can dynamically adjust to individual user preferences, further bridging the gap between simulated and real human behaviors. Investigating the integration of reinforcement learning with other learning paradigms like unsupervised or semi-supervised learning will also be a key direction.
AI Executive Summary
The rapid evolution of large language models (LLMs) has revolutionized natural language processing, enabling unprecedented capabilities in understanding and generating human-like responses. However, a persistent challenge remains: how to develop user simulators that can accurately emulate human behaviors in interactive settings. Traditional approaches primarily focus on response matching, using metrics like BLEU or maximizing likelihood, which often result in responses that are technically correct but lack the natural variability and authenticity of real human responses.
This paper introduces a groundbreaking framework called Turing-RL, which shifts the focus from response content matching to the core human trait of indistinguishability. Inspired by the classical Turing Test, the authors leverage a discriminative reward mechanism where a large language model acts as a judge, evaluating whether a generated response could have been produced by a real user. This approach directly optimizes the model to produce responses that are indistinguishable from genuine human responses conditioned on user history and context.
The methodology combines supervised fine-tuning with reinforcement learning, specifically employing Group Relative Policy Optimization (GRPO), to refine the user simulator. The process involves generating multiple responses via chain-of-thought prompting, which are then scored by the judge. The reward signal guides the model to produce more natural, diverse, and contextually grounded responses. Extensive experiments on datasets like PRISM and ConvoKit demonstrate that Turing-RL significantly outperforms baseline methods such as response similarity reward (Sim-RL) and log-probability maximization, achieving higher human-likeness scores and better content grounding.
The results underscore the importance of optimizing for behavioral indistinguishability rather than mere content replication. The trained models exhibit responses that are not only content-consistent but also more aligned with human behavioral patterns, making them valuable tools for training conversational agents, enhancing personalization, and conducting social science research. Despite these advances, challenges remain, including high computational costs and the need for further improvements in response diversity and emotional depth.
Looking ahead, the authors plan to incorporate multimodal data, reduce training costs, and extend their framework across languages and cultures. This research paves the way for more realistic, adaptable, and human-like AI systems, fundamentally transforming human-AI interaction paradigms and opening new avenues for scientific exploration and industrial application.
Deep Analysis
Background
The development of large language models (LLMs) such as GPT-4, Qwen, and others has significantly advanced natural language understanding and generation. Early user simulators relied on rule-based systems or limited statistical models, which could not capture the richness and variability of human responses. Recent efforts like Naous et al. (2025) and Wu et al. (2026) introduced specialized user models and latent state alignment, aiming to improve realism. However, these methods often optimize for content similarity or likelihood, which do not necessarily translate into human-like behavior. The challenge has been to create models that can generate responses indistinguishable from real users, capturing the nuances, diversity, and contextual appropriateness of human communication. The classical Turing Test provides a conceptual foundation for this goal, but operationalizing it within neural models requires innovative training strategies and evaluation metrics. This background sets the stage for the authors’ novel approach, which directly targets the core human trait of indistinguishability, rather than superficial content matching.
Core Problem
The core problem addressed is how to train user simulators that produce responses not only content-accurate but also behaviorally realistic and diverse. Existing methods tend to optimize for similarity to a ground truth response, which limits variability and often results in responses that are overly generic or robotic. This mismatch hampers the utility of simulators in training robust conversational agents and conducting social science experiments. The fundamental difficulty lies in defining a training objective that captures the essence of human-like behavior, including stylistic, contextual, and behavioral nuances. Moreover, the vast space of plausible responses makes it infeasible to rely solely on content matching. The challenge is to develop a training paradigm that encourages models to produce responses that are indistinguishable from real users, balancing content fidelity with behavioral authenticity.
Innovation
The key innovation is the integration of a discriminative Turing reward within a reinforcement learning framework. This reward is derived from a large language model judge that scores responses based on their indistinguishability from real user responses, conditioned on user history and context. Unlike traditional methods that maximize likelihood or content similarity, this approach directly optimizes for human-like behavior. The authors employ chain-of-thought prompting to generate diverse candidate responses, which are then evaluated by the judge. The use of GRPO allows stable, end-to-end training, effectively guiding the model toward more natural responses. This paradigm shift from content matching to behavioral indistinguishability represents a significant advancement, enabling the creation of user simulators that better emulate real human behaviors across different domains.
Methodology
- �� Input: 用户历史(h)、角色信息(ρ,可选)和当前会话上下文(x);
- �� 预训练:利用带链式推理的监督微调(SFT)在真实响应上进行初始化;
- �� 生成候选响应:模型基于输入,采用链式推理(CoT)生成多个候选响应;
- �� 判别模型:大规模LLM(如Sonnet 4.6)作为判别者,评估每个候选响应与真实响应的相似度,打分范围为1-7;
- �� 计算奖励:将判别分归一化,作为强化学习的奖励信号(rturing);
- �� 强化学习:利用GRPO算法,优化模型参数,使生成的响应在判别模型中获得更高分;
- �� 训练目标:最大化期望奖励,确保模型生成更具人类特征的响应;
- �� 评估:在多轮对话和Reddit数据集上进行模型性能测试,包括判别分、内容相似度和人类评估。
Experiments
实验使用PRISM(多轮对话)和ConvoKit(Reddit讨论)两个数据集,分别包含1500名用户的多轮对话和论坛帖子。模型在训练前经过SFT微调,随后进行强化学习微调,采样4个候选响应,判别模型(Sonnet 4.6)对每个响应进行打分。评估指标包括:判别分(1-7尺度)、响应内容相似度(百分比)、上下文相关性和个性一致性。基线模型包括Sim-RL(内容匹配奖励)和Logprob-RL(最大对数概率奖励),以及未训练模型(SFT-Init)和更大模型(GPT-5、Qwen 3.5-397B)。此外,还通过Prolific平台进行人类评估,验证模型的自然度和真实性。
Results
Turing-RL在判别分上显著优于其他模型,平均得分为5.3,远高于Sim-RL的4.7和SFT-Init的4.2。在内容相似度方面,Turing-RL与Sim-RL表现相当(均超过78%),但判别分明显更高,表明其在保持内容一致的同时,更具人类特征。人类评估显示,Turing-RL被识别为真实用户的概率为57%,优于其他模型。模型在两个场景中都表现出更高的真实性和自然性,验证了其在实际应用中的潜力。
Abstract
Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.