Interplay: Training Independent Simulators for Reference-Free Conversational Recommendation
Proposes a reference-free simulation framework by training independent user and recommender simulators for more realistic dialogues.
Key Findings
Methodology
The paper introduces a reference-free simulation framework by training two independent large language models (LLMs), one as a user simulator and the other as a conversational recommender. These models interact in real-time without access to predetermined target items, inferring user preferences through dialogue. The user simulator operates based on preference summaries and attribute descriptions, while the recommender generates contextually appropriate recommendations based solely on the evolving conversation. This design eliminates data leakage and allows conversations to naturally evolve, reflecting the complexity of real recommendation scenarios.
Key Results
- Result 1: The user simulator achieves a success rate of 93-95% in multi-turn dialogues, significantly outperforming larger zero-shot LLMs like Llama3.1 70B (36% success rate) and Qwen3 32B (77% success rate), validating that reference-free operation preserves realistic user behavior while eliminating artificial constraints.
- Result 2: The recommender simulator RecSim-Qwen8B achieves the highest performance with Recall@1 of 0.0217 and Match Score of 0.9333, significantly outperforming the larger Qwen3-32B baseline and exceeding the modular UniCRS system, demonstrating that fine-tuning smaller models for specific conversational recommendation tasks is more effective.
- Result 3: Human evaluations show that reference-free dialogues exhibit significantly more user control and conversational flow than reference-dependent methods, although they fall slightly short in naturalness.
Significance
This research addresses the prevalent issues of data leakage and rigid dialogues in conversational recommender systems by eliminating dependency on target items. By training independent user and recommender simulators, the generated dialogues are more realistic and diverse. This approach not only improves the quality of conversational recommendation data but also offers a scalable solution for generating high-quality conversational recommendation data without constraining conversations to predefined target items, making it significant for both academia and industry.
Technical Contribution
Technical contributions include: 1) proposing a reference-free simulation framework that eliminates data leakage; 2) achieving more realistic user and recommender behaviors through independently trained specialized models; 3) matching or exceeding existing methods in dialogue quality while using smaller open-source models for a more scalable and efficient solution.
Novelty
This study is the first to propose a reference-free simulation framework, distinguishing itself from traditional methods that rely on target items. By using target attributes instead of items, simulators engage in genuine exploration, generating more natural dialogues. This approach is a fundamental innovation in conversational recommender systems, addressing the issues of rigid dialogues and data leakage present in existing methods.
Limitations
- Limitation 1: Although the method performs well in the movie recommendation domain, its effectiveness in other domains such as e-commerce, music, and travel has not yet been validated, requiring future application and evaluation in different contexts.
- Limitation 2: The method employs a structured, task-oriented dialogue framework, which does not capture the full spectrum of real-world interactions, such as open-domain chit-chat and multi-intent utterances.
- Limitation 3: While Match Score is a useful proxy for recommendation quality, it has not yet been validated as a reliable metric that aligns with real user preferences, necessitating future user studies to evaluate the correlation between automatic metrics and human judgments.
Future Work
Future directions include: 1) applying and evaluating the reference-free simulation framework in different domains such as e-commerce, music, and travel; 2) enhancing simulators to handle more open, mixed-initiative conversations; 3) validating the correlation between automatic metrics and human judgments through user studies, further improving the realism and diversity of conversational recommender systems.
AI Executive Summary
Conversational Recommender Systems (CRS) have gained attention for their ability to provide personalized, context-sensitive recommendations through natural language conversations. However, the development of CRS relies on rich conversational data, and collecting human-annotated conversations is costly and limited in quality. Traditional simulation approaches often use a single large language model (LLM) that generates entire conversations with prior knowledge of the target items, leading to scripted and artificial dialogues.
This paper proposes a reference-free simulation framework by training two independent LLMs, one as a user simulator and the other as a conversational recommender. These models interact in real-time without access to predetermined target items, inferring user preferences through dialogue. The user simulator operates based on preference summaries and attribute descriptions, while the recommender generates contextually appropriate recommendations based solely on the evolving conversation. This design eliminates data leakage and allows conversations to naturally evolve, reflecting the complexity of real recommendation scenarios.
In experiments, the user simulator achieves a success rate of 93-95% in multi-turn dialogues, significantly outperforming larger zero-shot LLMs like Llama3.1 70B (36% success rate) and Qwen3 32B (77% success rate), validating that reference-free operation preserves realistic user behavior while eliminating artificial constraints. The recommender simulator RecSim-Qwen8B achieves the highest performance with Recall@1 of 0.0217 and Match Score of 0.9333, significantly outperforming the larger Qwen3-32B baseline and exceeding the modular UniCRS system, demonstrating that fine-tuning smaller models for specific conversational recommendation tasks is more effective.
Human evaluations show that reference-free dialogues exhibit significantly more user control and conversational flow than reference-dependent methods, although they fall slightly short in naturalness. This approach not only improves the quality of conversational recommendation data but also offers a scalable solution for generating high-quality conversational recommendation data without constraining conversations to predefined target items, making it significant for both academia and industry.
However, the method performs well in the movie recommendation domain, but its effectiveness in other domains such as e-commerce, music, and travel has not yet been validated, requiring future application and evaluation in different contexts. Additionally, while Match Score is a useful proxy for recommendation quality, it has not yet been validated as a reliable metric that aligns with real user preferences, necessitating future user studies to evaluate the correlation between automatic metrics and human judgments.
Deep Analysis
Background
Conversational Recommender Systems (CRS) have become a significant research direction in artificial intelligence and natural language processing. CRS can provide personalized, context-sensitive recommendations through natural language conversations. Unlike traditional systems that rely on static user-item interactions, CRS allows dynamic, interactive feedback from users and enables both parties to guide the conversation. However, the development of CRS depends on rich conversational data, and collecting human-annotated conversations is costly and limited in quality. Previous work highlights key issues in crowd-sourced CRS datasets: lack of genuine preferences, lack of depth and context, and lack of domain expertise among crowd workers, resulting in weak recommendations and poor explanations. With the growing popularity of large language models (LLMs), researchers have begun using LLM simulators to generate authentic and context-rich conversations, which are critical for training CRS. However, existing methods generally rely on reference-dependent generation, feeding target items to simulators in advance, leading to data leakage and rigid dialogues.
Core Problem
The core problem is how to generate realistic and diverse conversational data for training conversational recommender systems. Traditional simulation methods often use a single large language model (LLM) that generates entire conversations with prior knowledge of the target items, leading to scripted and artificial dialogues. Additionally, using general-purpose LLMs as static role-players for both conversational parties is problematic, as these models are designed as helpful assistants rather than realistic user simulators with diverse and sometimes inconsistent preferences. The resulting conversations often lack the natural exploration, uncertainty, and flexibility that characterize genuine human-recommender interactions.
Innovation
The core innovation of this paper is the proposal of a reference-free simulation framework by training two independent large language models (LLMs), one as a user simulator and the other as a conversational recommender. These models interact in real-time without access to predetermined target items, inferring user preferences through dialogue. Specific innovations include: 1) the reference-free design eliminates data leakage by ensuring that simulators discover information through conversation rather than having it predetermined; 2) independent training of specialized models creates more realistic user and recommender behaviors than using general-purpose LLMs for both roles; 3) the interactive nature allows flexible conversations that can naturally evolve in multiple directions, better reflecting the complexity of real recommendation scenarios.
Methodology
The methodology of this paper includes the following steps:
- �� Independently train user and recommender simulators: fine-tune on existing CRS data using a reference-free framework, ensuring no access to target items during generation.
- �� User simulator operation: operates based on preference summaries and attribute descriptions, providing realistic feedback without knowing specific targets.
- �� Recommender simulator generation: generates contextually appropriate recommendations based solely on the evolving conversation.
- �� Structured action generation: structure the entire model output to separate the action and natural language response, ensuring the model commits to an unambiguous action before generating its utterance.
- �� Role-specific loss masking: use a token masking strategy during loss computation, ensuring each simulator learns exclusively from its own turns.
Experiments
The experimental design includes comprehensive evaluations of both the user simulator and the recommender simulator. Baselines used include open-source models such as Llama3.1 70B and Qwen3 32B, as well as the modular UniCRS system. Evaluation metrics include success rate in multi-turn dialogues, single-turn response quality, and recommender performance (e.g., Recall@1 and Match Score). Experiments are conducted on a filtered test set to ensure fair comparison of user and recommender roles. Human evaluations are also conducted to verify the realism and fluency of reference-free dialogues.
Results
Experimental results show that the user simulator achieves a success rate of 93-95% in multi-turn dialogues, significantly outperforming larger zero-shot LLMs like Llama3.1 70B (36% success rate) and Qwen3 32B (77% success rate), validating that reference-free operation preserves realistic user behavior while eliminating artificial constraints. The recommender simulator RecSim-Qwen8B achieves the highest performance with Recall@1 of 0.0217 and Match Score of 0.9333, significantly outperforming the larger Qwen3-32B baseline and exceeding the modular UniCRS system, demonstrating that fine-tuning smaller models for specific conversational recommendation tasks is more effective. Human evaluations show that reference-free dialogues exhibit significantly more user control and conversational flow than reference-dependent methods, although they fall slightly short in naturalness.
Applications
Direct application scenarios of this method include:
- �� Online movie recommendation: generating more realistic dialogue data through the reference-free simulation framework to improve the performance of recommendation systems.
- �� E-commerce platforms: applying to product recommendations, helping users discover new products without preset targets.
- �� Customer service: applying reference-free simulation in customer service dialogues to improve the naturalness and user satisfaction of conversations.
Limitations & Outlook
Although the reference-free simulation framework performs well in the movie recommendation domain, its effectiveness in other domains such as e-commerce, music, and travel has not yet been validated, requiring future application and evaluation in different contexts. Additionally, the method employs a structured, task-oriented dialogue framework, which does not capture the full spectrum of real-world interactions, such as open-domain chit-chat and multi-intent utterances. While Match Score is a useful proxy for recommendation quality, it has not yet been validated as a reliable metric that aligns with real user preferences, necessitating future user studies to evaluate the correlation between automatic metrics and human judgments.
Plain Language Accessible to non-experts
Imagine you're at a restaurant ordering food. The traditional way of ordering is like you already know every dish on the menu, and you just tell the waiter what you want. However, sometimes you might not know exactly what you want, just that you want something spicy, preferably with chicken. In this case, the waiter needs to recommend dishes based on your description, rather than giving you a fixed option.
This paper's method is like training two independent waiters, one responsible for understanding your preferences, and the other for dynamically recommending dishes based on the conversation. This way, the recommendation process is more like a natural conversation rather than a pre-set script.
Through this approach, the system can generate more realistic dialogue data because it no longer relies on predetermined targets but infers user preferences based on the evolving conversation. It's like the restaurant waiter understanding your taste through dialogue rather than giving you a fixed menu option.
This method not only improves the realism and diversity of conversational recommender systems but also offers a scalable solution for generating high-quality conversational recommendation data without preset target items.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a game, and there's a robot assistant that recommends game quests based on your preferences. The traditional way is like the robot already knows all the quests, so it just recommends the ones it thinks you'll like.
But sometimes, those recommendations can get boring because the robot keeps suggesting the same things. So, scientists came up with a new way: let the robot guess what you like instead of giving it the answers.
It's like you and the robot are playing the game together. You tell it what kind of quests you like, like 'I want something challenging that makes me think.' Then, the robot recommends quests based on your description, not a fixed option.
This makes the game dialogues more fun because the robot adjusts its recommendations based on your feedback, like you're exploring the game world together. Isn't that cool? This method not only makes the game more interesting but also makes the robot smarter!
Glossary
Large Language Model
A large language model is a machine learning model capable of processing and generating natural language text. It typically uses deep learning techniques to understand context and generate coherent text.
Used in this paper to simulate user and recommender dialogues.
Conversational Recommender System
A conversational recommender system is a system that provides personalized recommendations through natural language conversations. It allows users to engage in multi-turn dialogues to receive more tailored recommendations.
The paper aims to enhance the realism and diversity of data generated for conversational recommender systems.
Reference-Free Simulation
Reference-free simulation is a method of generating dialogues without relying on predetermined target items. It infers user preferences through real-time interaction, producing more natural dialogues.
The proposed reference-free simulation framework eliminates data leakage issues.
Data Leakage
Data leakage refers to the unintended exposure of information during model training or testing, affecting the accuracy of performance evaluation.
Traditional dialogue generation methods suffer from data leakage due to pre-provided target items.
Structured Action Generation
Structured action generation is a method of structuring model output into explicit actions and natural language responses, ensuring the model commits to an unambiguous action before generating its utterance.
The method improves the controllability and analyzability of dialogues through structured action generation.
Role-Specific Loss Masking
Role-specific loss masking is a strategy used during loss computation to ensure each simulator learns exclusively from its own turns.
The paper uses role-specific loss masking to avoid role-swapping issues.
Success Rate
Success rate refers to the proportion of dialogues in which the user simulator successfully accepts recommendations.
The experimental results show a success rate of 93-95% in multi-turn dialogues.
Match Score
Match Score is a metric used to evaluate recommendation quality, quantifying the similarity between recommended items and the ground-truth target items.
The paper evaluates the performance of the recommender simulator using Match Score.
Recall@1
Recall@1 is a metric for evaluating the performance of recommendation systems, indicating the proportion of dialogues where the recommended top item includes the ground-truth target item.
The RecSim-Qwen8B achieves a Recall@1 of 0.0217.
BertScore
BertScore is a metric for evaluating the similarity between generated text and reference text, based on embeddings from the BERT model.
The paper uses BertScore to assess the single-turn response quality of the user simulator.
Dist-4
Dist-4 is a metric for evaluating the diversity of generated text, calculating the diversity of four-grams in the text.
The paper assesses the diversity of generated dialogues using Dist-4.
Open-Domain Chit-Chat
Open-domain chit-chat refers to a form of dialogue not constrained by specific tasks, allowing participants to freely discuss various topics.
The method does not capture open-domain chit-chat dialogue forms.
Multi-Intent Utterance
A multi-intent utterance is a dialogue turn containing multiple intents, such as asking a question and providing feedback simultaneously.
The method does not handle multi-intent utterance dialogue forms.
Modular System
A modular system is a design approach that decomposes system functionality into independent modules, each responsible for a specific function.
The experimental setup includes the modular UniCRS system as a baseline.
Human Evaluation
Human evaluation is a method of subjectively assessing system output through human participants, often used to evaluate the naturalness and fluency of dialogue systems.
The paper conducts human evaluations to verify the realism and fluency of reference-free dialogues.
Open Questions Unanswered questions from this research
- 1 Open Question 1: The effectiveness of the reference-free simulation framework in other domains such as e-commerce, music, and travel has not yet been validated, requiring future application and evaluation in different contexts.
- 2 Open Question 2: How to enhance simulators to handle more open, mixed-initiative conversations, capturing the full spectrum of real-world interactions such as open-domain chit-chat and multi-intent utterances.
- 3 Open Question 3: While Match Score is a useful proxy for recommendation quality, it has not yet been validated as a reliable metric that aligns with real user preferences, necessitating future user studies to evaluate the correlation between automatic metrics and human judgments.
- 4 Open Question 4: How to further improve the realism and diversity of conversational recommender systems without increasing computational costs.
- 5 Open Question 5: How to generate higher-quality conversational recommendation data without relying on predetermined target items to enhance the performance of conversational recommender systems.
Applications
Immediate Applications
Online Movie Recommendation
Generating more realistic dialogue data through the reference-free simulation framework to improve the performance of recommendation systems.
E-commerce Platforms
Applying to product recommendations, helping users discover new products without preset targets.
Customer Service
Applying reference-free simulation in customer service dialogues to improve the naturalness and user satisfaction of conversations.
Long-term Vision
Cross-Domain Conversational Recommendation
Applying the reference-free simulation framework to multiple domains such as music and travel, providing personalized recommendation services.
Intelligent Conversational Assistants
Developing intelligent conversational assistants capable of handling open-domain chit-chat and multi-intent utterances, enhancing the naturalness and diversity of human-computer interactions.
Abstract
Training conversational recommender systems (CRS) requires extensive dialogue data, which is challenging to collect at scale. To address this, researchers have used simulated user-recommender conversations. Traditional simulation approaches often utilize a single large language model (LLM) that generates entire conversations with prior knowledge of the target items, leading to scripted and artificial dialogues. We propose a reference-free simulation framework that trains two independent LLMs, one as the user and one as the conversational recommender. These models interact in real-time without access to predetermined target items, but preference summaries and target attributes, enabling the recommender to genuinely infer user preferences through dialogue. This approach produces more realistic and diverse conversations that closely mirror authentic human-AI interactions. Our reference-free simulators match or exceed existing methods in quality, while offering a scalable solution for generating high-quality conversational recommendation data without constraining conversations to pre-defined target items. We conduct both quantitative and human evaluations to confirm the effectiveness of our reference-free approach.
References (20)
Towards Unified Conversational Recommender Systems via Knowledge-Enhanced Prompt Learning
Xiaolei Wang, Kun Zhou, Ji-rong Wen et al.
Pearl: A Review-driven Persona-Knowledge Grounded Conversational Recommendation Dataset
Minjin Kim, Minju Kim, Hana Kim et al.
How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation
Lixi Zhu, Xiaowen Huang, Jitao Sang
Multi-Objective Intrinsic Reward Learning for Conversational Recommender Systems
Zhendong Chu, Nan Wang, Hongning Wang
From Reviews to Dialogues: Active Synthesis for Zero-Shot LLM-based Conversational Recommender System
Rohan Surana, Junda Wu, Zhouhang Xie et al.
Recommendation as a Communication Game: Self-Supervised Bot-Play for Goal-oriented Dialogue
Dongyeop Kang, Anusha Balakrishnan, Pararth Shah et al.
A LLM-based Controllable, Scalable, Human-Involved User Simulator Framework for Conversational Recommender Systems
Lixi Zhu, Xiaowen Huang, Jitao Sang
BPR: Bayesian Personalized Ranking from Implicit Feedback
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner et al.
PlatoLM: Teaching LLMs in Multi-Round Dialogue via a User Simulator
Chuyi Kong, Yaxin Fan, Xiang Wan et al.
Don't lie to your friends: Learning what you know from collaborative self-play
Jacob Eisenstein, Reza Aghajani, Adam Fisch et al.
Doing Personal LAPS: LLM-Augmented Dialogue Construction for Personalized Multi-Session Conversational Search
Hideaki Joko, Shubham Chatterjee, A. Ramsay et al.
Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models
Xiaolei Wang, Xinyu Tang, Wayne Xin Zhao et al.
LLM-REDIAL: A Large-Scale Dataset for Conversational Recommender Systems Created from User Behaviors with LLMs
Tingting Liang, Chenxin Jin, Lingzhi Wang et al.
Empowering Retrieval-based Conversational Recommendation with Contrasting User Preferences
Heejin Kook, Junyoung Kim, Seongmin Park et al.
ChatGPT as a Conversational Recommender System: A User-Centric Analysis
A. Manzoor, Samuel C. Ziegler, Klaus Maria. Pirker Garcia et al.
TinyBERT: Distilling BERT for Natural Language Understanding
Xiaoqi Jiao, Yichun Yin, Lifeng Shang et al.
A Conversation is Worth A Thousand Recommendations: A Survey of Holistic Conversational Recommender Systems
Chuang Li, Hengchang Hu, Yan Zhang et al.
Towards Deep Conversational Recommendations
Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz et al.
FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems
Hideaki Joko, Faegheh Hasibi
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu et al.