Interplay: Training Independent Simulators for Reference-Free Conversational Recommendation

TL;DR

Proposes a reference-free simulation framework by training independent user and recommender simulators for more realistic dialogues.

cs.AI 🔴 Advanced 2026-03-19 53 views

Jerome Ramos Feng Xia Xi Wang Shubham Chatterjee Xiao Fu Hossein A. Rahmani Aldo Lipani

AI Reader Arxiv Page Download PDF

Conversational Recommender Systems Large Language Models Reference-Free Simulation User Simulation Recommendation Algorithms

Key Findings

Methodology

The paper introduces a reference-free simulation framework by training two independent large language models (LLMs), one as a user simulator and the other as a conversational recommender. These models interact in real-time without access to predetermined target items, inferring user preferences through dialogue. The user simulator operates based on preference summaries and attribute descriptions, while the recommender generates contextually appropriate recommendations based solely on the evolving conversation. This design eliminates data leakage and allows conversations to naturally evolve, reflecting the complexity of real recommendation scenarios.

Key Results

Result 1: The user simulator achieves a success rate of 93-95% in multi-turn dialogues, significantly outperforming larger zero-shot LLMs like Llama3.1 70B (36% success rate) and Qwen3 32B (77% success rate), validating that reference-free operation preserves realistic user behavior while eliminating artificial constraints.
Result 2: The recommender simulator RecSim-Qwen8B achieves the highest performance with Recall@1 of 0.0217 and Match Score of 0.9333, significantly outperforming the larger Qwen3-32B baseline and exceeding the modular UniCRS system, demonstrating that fine-tuning smaller models for specific conversational recommendation tasks is more effective.
Result 3: Human evaluations show that reference-free dialogues exhibit significantly more user control and conversational flow than reference-dependent methods, although they fall slightly short in naturalness.

Significance

This research addresses the prevalent issues of data leakage and rigid dialogues in conversational recommender systems by eliminating dependency on target items. By training independent user and recommender simulators, the generated dialogues are more realistic and diverse. This approach not only improves the quality of conversational recommendation data but also offers a scalable solution for generating high-quality conversational recommendation data without constraining conversations to predefined target items, making it significant for both academia and industry.

Technical Contribution

Technical contributions include: 1) proposing a reference-free simulation framework that eliminates data leakage; 2) achieving more realistic user and recommender behaviors through independently trained specialized models; 3) matching or exceeding existing methods in dialogue quality while using smaller open-source models for a more scalable and efficient solution.

Novelty

This study is the first to propose a reference-free simulation framework, distinguishing itself from traditional methods that rely on target items. By using target attributes instead of items, simulators engage in genuine exploration, generating more natural dialogues. This approach is a fundamental innovation in conversational recommender systems, addressing the issues of rigid dialogues and data leakage present in existing methods.

Limitations

Limitation 1: Although the method performs well in the movie recommendation domain, its effectiveness in other domains such as e-commerce, music, and travel has not yet been validated, requiring future application and evaluation in different contexts.
Limitation 2: The method employs a structured, task-oriented dialogue framework, which does not capture the full spectrum of real-world interactions, such as open-domain chit-chat and multi-intent utterances.
Limitation 3: While Match Score is a useful proxy for recommendation quality, it has not yet been validated as a reliable metric that aligns with real user preferences, necessitating future user studies to evaluate the correlation between automatic metrics and human judgments.

Future Work

Future directions include: 1) applying and evaluating the reference-free simulation framework in different domains such as e-commerce, music, and travel; 2) enhancing simulators to handle more open, mixed-initiative conversations; 3) validating the correlation between automatic metrics and human judgments through user studies, further improving the realism and diversity of conversational recommender systems.

AI Executive Summary

Conversational Recommender Systems (CRS) have gained attention for their ability to provide personalized, context-sensitive recommendations through natural language conversations. However, the development of CRS relies on rich conversational data, and collecting human-annotated conversations is costly and limited in quality. Traditional simulation approaches often use a single large language model (LLM) that generates entire conversations with prior knowledge of the target items, leading to scripted and artificial dialogues.

This paper proposes a reference-free simulation framework by training two independent LLMs, one as a user simulator and the other as a conversational recommender. These models interact in real-time without access to predetermined target items, inferring user preferences through dialogue. The user simulator operates based on preference summaries and attribute descriptions, while the recommender generates contextually appropriate recommendations based solely on the evolving conversation. This design eliminates data leakage and allows conversations to naturally evolve, reflecting the complexity of real recommendation scenarios.

In experiments, the user simulator achieves a success rate of 93-95% in multi-turn dialogues, significantly outperforming larger zero-shot LLMs like Llama3.1 70B (36% success rate) and Qwen3 32B (77% success rate), validating that reference-free operation preserves realistic user behavior while eliminating artificial constraints. The recommender simulator RecSim-Qwen8B achieves the highest performance with Recall@1 of 0.0217 and Match Score of 0.9333, significantly outperforming the larger Qwen3-32B baseline and exceeding the modular UniCRS system, demonstrating that fine-tuning smaller models for specific conversational recommendation tasks is more effective.

Human evaluations show that reference-free dialogues exhibit significantly more user control and conversational flow than reference-dependent methods, although they fall slightly short in naturalness. This approach not only improves the quality of conversational recommendation data but also offers a scalable solution for generating high-quality conversational recommendation data without constraining conversations to predefined target items, making it significant for both academia and industry.

However, the method performs well in the movie recommendation domain, but its effectiveness in other domains such as e-commerce, music, and travel has not yet been validated, requiring future application and evaluation in different contexts. Additionally, while Match Score is a useful proxy for recommendation quality, it has not yet been validated as a reliable metric that aligns with real user preferences, necessitating future user studies to evaluate the correlation between automatic metrics and human judgments.

Deep Analysis

Background

Conversational Recommender Systems (CRS) have become a significant research direction in artificial intelligence and natural language processing. CRS can provide personalized, context-sensitive recommendations through natural language conversations. Unlike traditional systems that rely on static user-item interactions, CRS allows dynamic, interactive feedback from users and enables both parties to guide the conversation. However, the development of CRS depends on rich conversational data, and collecting human-annotated conversations is costly and limited in quality. Previous work highlights key issues in crowd-sourced CRS datasets: lack of genuine preferences, lack of depth and context, and lack of domain expertise among crowd workers, resulting in weak recommendations and poor explanations. With the growing popularity of large language models (LLMs), researchers have begun using LLM simulators to generate authentic and context-rich conversations, which are critical for training CRS. However, existing methods generally rely on reference-dependent generation, feeding target items to simulators in advance, leading to data leakage and rigid dialogues.

Core Problem

The core problem is how to generate realistic and diverse conversational data for training conversational recommender systems. Traditional simulation methods often use a single large language model (LLM) that generates entire conversations with prior knowledge of the target items, leading to scripted and artificial dialogues. Additionally, using general-purpose LLMs as static role-players for both conversational parties is problematic, as these models are designed as helpful assistants rather than realistic user simulators with diverse and sometimes inconsistent preferences. The resulting conversations often lack the natural exploration, uncertainty, and flexibility that characterize genuine human-recommender interactions.

Innovation

The core innovation of this paper is the proposal of a reference-free simulation framework by training two independent large language models (LLMs), one as a user simulator and the other as a conversational recommender. These models interact in real-time without access to predetermined target items, inferring user preferences through dialogue. Specific innovations include: 1) the reference-free design eliminates data leakage by ensuring that simulators discover information through conversation rather than having it predetermined; 2) independent training of specialized models creates more realistic user and recommender behaviors than using general-purpose LLMs for both roles; 3) the interactive nature allows flexible conversations that can naturally evolve in multiple directions, better reflecting the complexity of real recommendation scenarios.

Methodology

The methodology of this paper includes the following steps:

�� Independently train user and recommender simulators: fine-tune on existing CRS data using a reference-free framework, ensuring no access to target items during generation.
�� User simulator operation: operates based on preference summaries and attribute descriptions, providing realistic feedback without knowing specific targets.
�� Recommender simulator generation: generates contextually appropriate recommendations based solely on the evolving conversation.
�� Structured action generation: structure the entire model output to separate the action and natural language response, ensuring the model commits to an unambiguous action before generating its utterance.
�� Role-specific loss masking: use a token masking strategy during loss computation, ensuring each simulator learns exclusively from its own turns.

Experiments

The experimental design includes comprehensive evaluations of both the user simulator and the recommender simulator. Baselines used include open-source models such as Llama3.1 70B and Qwen3 32B, as well as the modular UniCRS system. Evaluation metrics include success rate in multi-turn dialogues, single-turn response quality, and recommender performance (e.g., Recall@1 and Match Score). Experiments are conducted on a filtered test set to ensure fair comparison of user and recommender roles. Human evaluations are also conducted to verify the realism and fluency of reference-free dialogues.

Results

Experimental results show that the user simulator achieves a success rate of 93-95% in multi-turn dialogues, significantly outperforming larger zero-shot LLMs like Llama3.1 70B (36% success rate) and Qwen3 32B (77% success rate), validating that reference-free operation preserves realistic user behavior while eliminating artificial constraints. The recommender simulator RecSim-Qwen8B achieves the highest performance with Recall@1 of 0.0217 and Match Score of 0.9333, significantly outperforming the larger Qwen3-32B baseline and exceeding the modular UniCRS system, demonstrating that fine-tuning smaller models for specific conversational recommendation tasks is more effective. Human evaluations show that reference-free dialogues exhibit significantly more user control and conversational flow than reference-dependent methods, although they fall slightly short in naturalness.

Applications

Direct application scenarios of this method include:

�� Online movie recommendation: generating more realistic dialogue data through the reference-free simulation framework to improve the performance of recommendation systems.
�� E-commerce platforms: applying to product recommendations, helping users discover new products without preset targets.
�� Customer service: applying reference-free simulation in customer service dialogues to improve the naturalness and user satisfaction of conversations.

Limitations & Outlook

Although the reference-free simulation framework performs well in the movie recommendation domain, its effectiveness in other domains such as e-commerce, music, and travel has not yet been validated, requiring future application and evaluation in different contexts. Additionally, the method employs a structured, task-oriented dialogue framework, which does not capture the full spectrum of real-world interactions, such as open-domain chit-chat and multi-intent utterances. While Match Score is a useful proxy for recommendation quality, it has not yet been validated as a reliable metric that aligns with real user preferences, necessitating future user studies to evaluate the correlation between automatic metrics and human judgments.

Plain Language Accessible to non-experts

Imagine you're at a restaurant ordering food. The traditional way of ordering is like you already know every dish on the menu, and you just tell the waiter what you want. However, sometimes you might not know exactly what you want, just that you want something spicy, preferably with chicken. In this case, the waiter needs to recommend dishes based on your description, rather than giving you a fixed option.

This paper's method is like training two independent waiters, one responsible for understanding your preferences, and the other for dynamically recommending dishes based on the conversation. This way, the recommendation process is more like a natural conversation rather than a pre-set script.

Through this approach, the system can generate more realistic dialogue data because it no longer relies on predetermined targets but infers user preferences based on the evolving conversation. It's like the restaurant waiter understanding your taste through dialogue rather than giving you a fixed menu option.

This method not only improves the realism and diversity of conversational recommender systems but also offers a scalable solution for generating high-quality conversational recommendation data without preset target items.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a game, and there's a robot assistant that recommends game quests based on your preferences. The traditional way is like the robot already knows all the quests, so it just recommends the ones it thinks you'll like.

But sometimes, those recommendations can get boring because the robot keeps suggesting the same things. So, scientists came up with a new way: let the robot guess what you like instead of giving it the answers.

It's like you and the robot are playing the game together. You tell it what kind of quests you like, like 'I want something challenging that makes me think.' Then, the robot recommends quests based on your description, not a fixed option.

This makes the game dialogues more fun because the robot adjusts its recommendations based on your feedback, like you're exploring the game world together. Isn't that cool? This method not only makes the game more interesting but also makes the robot smarter!

Glossary

Large Language Model

A large language model is a machine learning model capable of processing and generating natural language text. It typically uses deep learning techniques to understand context and generate coherent text.

Used in this paper to simulate user and recommender dialogues.

Conversational Recommender System

A conversational recommender system is a system that provides personalized recommendations through natural language conversations. It allows users to engage in multi-turn dialogues to receive more tailored recommendations.

The paper aims to enhance the realism and diversity of data generated for conversational recommender systems.

Reference-Free Simulation

Reference-free simulation is a method of generating dialogues without relying on predetermined target items. It infers user preferences through real-time interaction, producing more natural dialogues.

The proposed reference-free simulation framework eliminates data leakage issues.

Data Leakage

Data leakage refers to the unintended exposure of information during model training or testing, affecting the accuracy of performance evaluation.

Traditional dialogue generation methods suffer from data leakage due to pre-provided target items.

Structured Action Generation

Structured action generation is a method of structuring model output into explicit actions and natural language responses, ensuring the model commits to an unambiguous action before generating its utterance.

The method improves the controllability and analyzability of dialogues through structured action generation.

Role-Specific Loss Masking

Role-specific loss masking is a strategy used during loss computation to ensure each simulator learns exclusively from its own turns.

The paper uses role-specific loss masking to avoid role-swapping issues.

Success Rate

Success rate refers to the proportion of dialogues in which the user simulator successfully accepts recommendations.

The experimental results show a success rate of 93-95% in multi-turn dialogues.

Match Score

Match Score is a metric used to evaluate recommendation quality, quantifying the similarity between recommended items and the ground-truth target items.

The paper evaluates the performance of the recommender simulator using Match Score.

Recall@1

Recall@1 is a metric for evaluating the performance of recommendation systems, indicating the proportion of dialogues where the recommended top item includes the ground-truth target item.

The RecSim-Qwen8B achieves a Recall@1 of 0.0217.

BertScore

BertScore is a metric for evaluating the similarity between generated text and reference text, based on embeddings from the BERT model.

The paper uses BertScore to assess the single-turn response quality of the user simulator.

Dist-4

Dist-4 is a metric for evaluating the diversity of generated text, calculating the diversity of four-grams in the text.

The paper assesses the diversity of generated dialogues using Dist-4.

Open-Domain Chit-Chat

Open-domain chit-chat refers to a form of dialogue not constrained by specific tasks, allowing participants to freely discuss various topics.

The method does not capture open-domain chit-chat dialogue forms.

Multi-Intent Utterance

A multi-intent utterance is a dialogue turn containing multiple intents, such as asking a question and providing feedback simultaneously.

The method does not handle multi-intent utterance dialogue forms.

Modular System

A modular system is a design approach that decomposes system functionality into independent modules, each responsible for a specific function.

The experimental setup includes the modular UniCRS system as a baseline.

Human Evaluation

Human evaluation is a method of subjectively assessing system output through human participants, often used to evaluate the naturalness and fluency of dialogue systems.

The paper conducts human evaluations to verify the realism and fluency of reference-free dialogues.

Open Questions Unanswered questions from this research

1 Open Question 1: The effectiveness of the reference-free simulation framework in other domains such as e-commerce, music, and travel has not yet been validated, requiring future application and evaluation in different contexts.
2 Open Question 2: How to enhance simulators to handle more open, mixed-initiative conversations, capturing the full spectrum of real-world interactions such as open-domain chit-chat and multi-intent utterances.
3 Open Question 3: While Match Score is a useful proxy for recommendation quality, it has not yet been validated as a reliable metric that aligns with real user preferences, necessitating future user studies to evaluate the correlation between automatic metrics and human judgments.
4 Open Question 4: How to further improve the realism and diversity of conversational recommender systems without increasing computational costs.
5 Open Question 5: How to generate higher-quality conversational recommendation data without relying on predetermined target items to enhance the performance of conversational recommender systems.

Applications

Immediate Applications

Online Movie Recommendation

Generating more realistic dialogue data through the reference-free simulation framework to improve the performance of recommendation systems.

E-commerce Platforms

Applying to product recommendations, helping users discover new products without preset targets.

Customer Service

Applying reference-free simulation in customer service dialogues to improve the naturalness and user satisfaction of conversations.

Long-term Vision

Cross-Domain Conversational Recommendation

Applying the reference-free simulation framework to multiple domains such as music and travel, providing personalized recommendation services.

Intelligent Conversational Assistants

Developing intelligent conversational assistants capable of handling open-domain chit-chat and multi-intent utterances, enhancing the naturalness and diversity of human-computer interactions.

Abstract

Training conversational recommender systems (CRS) requires extensive dialogue data, which is challenging to collect at scale. To address this, researchers have used simulated user-recommender conversations. Traditional simulation approaches often utilize a single large language model (LLM) that generates entire conversations with prior knowledge of the target items, leading to scripted and artificial dialogues. We propose a reference-free simulation framework that trains two independent LLMs, one as the user and one as the conversational recommender. These models interact in real-time without access to predetermined target items, but preference summaries and target attributes, enabling the recommender to genuinely infer user preferences through dialogue. This approach produces more realistic and diverse conversations that closely mirror authentic human-AI interactions. Our reference-free simulators match or exceed existing methods in quality, while offering a scalable solution for generating high-quality conversational recommendation data without constraining conversations to pre-defined target items. We conduct both quantitative and human evaluations to confirm the effectiveness of our reference-free approach.

cs.AI cs.IR

References (20)

Towards Unified Conversational Recommender Systems via Knowledge-Enhanced Prompt Learning

Xiaolei Wang, Kun Zhou, Ji-rong Wen et al.

2022 186 citations ⭐ Influential View Analysis →

Pearl: A Review-driven Persona-Knowledge Grounded Conversational Recommendation Dataset

Minjin Kim, Minju Kim, Hana Kim et al.

2024 26 citations ⭐ Influential View Analysis →

How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation

Lixi Zhu, Xiaowen Huang, Jitao Sang

2024 36 citations View Analysis →

Multi-Objective Intrinsic Reward Learning for Conversational Recommender Systems

Zhendong Chu, Nan Wang, Hongning Wang

2023 4 citations View Analysis →

From Reviews to Dialogues: Active Synthesis for Zero-Shot LLM-based Conversational Recommender System

Rohan Surana, Junda Wu, Zhouhang Xie et al.

2025 5 citations View Analysis →

Recommendation as a Communication Game: Self-Supervised Bot-Play for Goal-oriented Dialogue

Dongyeop Kang, Anusha Balakrishnan, Pararth Shah et al.

2019 120 citations View Analysis →

A LLM-based Controllable, Scalable, Human-Involved User Simulator Framework for Conversational Recommender Systems

Lixi Zhu, Xiaowen Huang, Jitao Sang

2024 12 citations View Analysis →

BPR: Bayesian Personalized Ranking from Implicit Feedback

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner et al.

2009 6481 citations View Analysis →

PlatoLM: Teaching LLMs in Multi-Round Dialogue via a User Simulator

Chuyi Kong, Yaxin Fan, Xiang Wan et al.

2023 25 citations View Analysis →

Don't lie to your friends: Learning what you know from collaborative self-play

Jacob Eisenstein, Reza Aghajani, Adam Fisch et al.

2025 8 citations View Analysis →

Doing Personal LAPS: LLM-Augmented Dialogue Construction for Personalized Multi-Session Conversational Search

Hideaki Joko, Shubham Chatterjee, A. Ramsay et al.

2024 45 citations View Analysis →

Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models

Xiaolei Wang, Xinyu Tang, Wayne Xin Zhao et al.

2023 84 citations View Analysis →

LLM-REDIAL: A Large-Scale Dataset for Conversational Recommender Systems Created from User Behaviors with LLMs

Tingting Liang, Chenxin Jin, Lingzhi Wang et al.

2024 27 citations

Empowering Retrieval-based Conversational Recommendation with Contrasting User Preferences

Heejin Kook, Junyoung Kim, Seongmin Park et al.

2025 4 citations View Analysis →

ChatGPT as a Conversational Recommender System: A User-Centric Analysis

A. Manzoor, Samuel C. Ziegler, Klaus Maria. Pirker Garcia et al.

2024 16 citations

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang et al.

2019 2256 citations View Analysis →

A Conversation is Worth A Thousand Recommendations: A Survey of Holistic Conversational Recommender Systems

Chuang Li, Hengchang Hu, Yan Zhang et al.

2023 7 citations View Analysis →

Towards Deep Conversational Recommendations

Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz et al.

2018 449 citations View Analysis →

FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems

Hideaki Joko, Faegheh Hasibi

2025 4 citations View Analysis →

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu et al.

2019 8001 citations View Analysis →

Interplay: Training Independent Simulators for Reference-Free Conversational Recommendation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Large Language Model

Conversational Recommender System

Reference-Free Simulation

Data Leakage

Structured Action Generation

Role-Specific Loss Masking

Success Rate

Match Score

Recall@1

BertScore

Dist-4

Open-Domain Chit-Chat

Multi-Intent Utterance

Modular System

Human Evaluation

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Online Movie Recommendation

E-commerce Platforms

Customer Service

Long-term Vision

Cross-Domain Conversational Recommendation

Intelligent Conversational Assistants

Abstract

References (20)

Related Papers

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Large Language Models Exhibit Normative Conformity