LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation
LifeSim simulates user cognition via BDI model to enhance personalized assistant evaluation.
Key Findings
Methodology
LifeSim models user cognition through the Belief-Desire-Intention (BDI) model within physical environments to generate coherent life trajectories and simulate intention-driven user interactive behaviors. LifeSim-Eval is a comprehensive benchmark covering 8 life domains and 1,200 diverse scenarios, adopting a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses.
Key Results
- Experiments reveal significant limitations of current large language models in handling implicit intention and long-term user preference modeling. Specifically, on the LifeSim-Eval benchmark, GPT-5 achieved 79.5% accuracy in explicit intention recognition but only 52.2% in implicit intention recognition.
- In long-horizon settings, while models maintain stable performance on explicit intentions, implicit intention completion rates significantly decline as conversation history grows.
- Simple profile memory offers limited benefits, indicating that effective personalization requires stable preference reasoning beyond simple retention.
Significance
The introduction of LifeSim and LifeSim-Eval provides a more realistic testing platform for evaluating personalized assistants. By simulating user cognition and life trajectories, this research fills the gap in existing benchmarks in capturing the complexity of external contexts and users' cognitive states. It not only advances academic research in personalized intelligence but also offers new evaluation tools for developing smarter AI assistants in the industry.
Technical Contribution
LifeSim introduces the BDI model in user simulation, combining an event engine to generate life trajectories and a user behavior engine to produce responses aligned with user cognition and external contexts. This approach contrasts sharply with existing static or short-context datasets, providing a high-fidelity long-horizon user-assistant interaction simulation framework.
Novelty
LifeSim is the first to integrate the BDI model with physical environments in user simulation, generating coherent life trajectories and intention-driven interactive behaviors. Compared to existing benchmarks, it offers higher fidelity in multi-scenario and long-horizon personalized evaluation.
Limitations
- Current models show significant limitations in handling implicit intentions and long-term user preference modeling, particularly as conversation history grows.
- LifeSim-Eval currently focuses on everyday life scenarios and does not yet cover high-stakes domains such as healthcare and legal consultation.
- Lacks multimodal user signals, primarily simulating user behavior dynamics through textual interactions.
Future Work
Future research directions include extending LifeSim-Eval to cover high-stakes domains, integrating multimodal information to enhance simulation realism, and developing more sophisticated user preference modeling methods to improve implicit intention recognition and long-term user modeling.
AI Executive Summary
With the rapid advancement of large language models (LLMs), the vision of universal AI assistants is becoming increasingly achievable. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors.
Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark covering 8 life domains and 1,200 diverse scenarios, adopting a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Our experiments reveal that current large language models face significant limitations in handling implicit intention and long-term user preference modeling.
LifeSim uses the BDI model combined with an event engine to generate life trajectories and a user behavior engine to produce responses aligned with user cognition and external contexts. This approach contrasts sharply with existing static or short-context datasets, providing a high-fidelity long-horizon user-assistant interaction simulation framework.
Experimental results show that while models perform well in explicit intention recognition, there is significant room for improvement in implicit intention recognition and long-term user preference modeling. Specifically, on the LifeSim-Eval benchmark, GPT-5 achieved 79.5% accuracy in explicit intention recognition but only 52.2% in implicit intention recognition.
The introduction of LifeSim and LifeSim-Eval provides a more realistic testing platform for evaluating personalized assistants. By simulating user cognition and life trajectories, this research fills the gap in existing benchmarks in capturing the complexity of external contexts and users' cognitive states. It not only advances academic research in personalized intelligence but also offers new evaluation tools for developing smarter AI assistants in the industry.
Future research directions include extending LifeSim-Eval to cover high-stakes domains, integrating multimodal information to enhance simulation realism, and developing more sophisticated user preference modeling methods to improve implicit intention recognition and long-term user modeling.
Deep Analysis
Background
In recent years, with the rapid development of large language models (LLMs), the vision of universal AI assistants is becoming increasingly achievable. Existing research has primarily focused on optimizing the model's ability to handle complex and knowledge-intensive tasks, as well as improving its social intelligence. However, there remains a clear gap between current evaluation frameworks and real-world scenarios, constraining advances in personalized intelligence. Ideal user-assistant interactions fundamentally differ from isolated question answering and involve complex external environments and users' cognitive states. User needs vary in terms of situational factors such as time, location, weather, and ongoing life events. User intentions arise from internal cognitive states, jointly shaped by evolving life experiences and relatively stable personalities and preferences. Real-world user data are constrained by privacy and ethical considerations, and publicly available interaction logs spanning multiple years and diverse scenarios remain extremely scarce. Therefore, establishing a realistic testbed with long-term user-assistant interactions at scale poses a fundamental problem.
Core Problem
Existing benchmarks for personalized assistants fail to capture the complexity of external contexts and users' cognitive states, leading to a misalignment with real-world user-assistant interactions. This mismatch constrains advances in personalized intelligence, as ideal user-assistant interactions involve complex external environments and users' cognitive states, rather than isolated question answering. User needs vary in terms of situational factors such as time, location, weather, and ongoing life events. User intentions arise from internal cognitive states, jointly shaped by evolving life experiences and relatively stable personalities and preferences.
Innovation
The core innovation of LifeSim lies in its use of the Belief-Desire-Intention (BDI) model to simulate user cognition within physical environments, generating coherent life trajectories and simulating intention-driven user interactive behaviors. Compared to existing static or short-context datasets, LifeSim provides a high-fidelity long-horizon user-assistant interaction simulation framework. LifeSim-Eval is a comprehensive benchmark covering 8 life domains and 1,200 diverse scenarios, adopting a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses.
Methodology
- �� LifeSim models user cognition through the Belief-Desire-Intention (BDI) model within physical environments.
- �� An event engine generates life trajectories, and a user behavior engine produces responses aligned with user cognition and external contexts.
- �� LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, adopting a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses.
- �� Experiments reveal significant limitations of current large language models in handling implicit intention and long-term user preference modeling.
Experiments
The experimental design includes evaluating the performance of various open-source and proprietary models on the LifeSim-Eval benchmark. The benchmark covers 8 life domains and 1,200 diverse scenarios, adopting a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Experimental results show that while models perform well in explicit intention recognition, there is significant room for improvement in implicit intention recognition and long-term user preference modeling.
Results
Experimental results show that while models perform well in explicit intention recognition, there is significant room for improvement in implicit intention recognition and long-term user preference modeling. Specifically, on the LifeSim-Eval benchmark, GPT-5 achieved 79.5% accuracy in explicit intention recognition but only 52.2% in implicit intention recognition. In long-horizon settings, while models maintain stable performance on explicit intentions, implicit intention completion rates significantly decline as conversation history grows.
Applications
LifeSim and LifeSim-Eval provide a more realistic testing platform for evaluating personalized assistants. By simulating user cognition and life trajectories, this research fills the gap in existing benchmarks in capturing the complexity of external contexts and users' cognitive states. It not only advances academic research in personalized intelligence but also offers new evaluation tools for developing smarter AI assistants in the industry.
Limitations & Outlook
Current models show significant limitations in handling implicit intentions and long-term user preference modeling, particularly as conversation history grows. LifeSim-Eval currently focuses on everyday life scenarios and does not yet cover high-stakes domains such as healthcare and legal consultation. Lacks multimodal user signals, primarily simulating user behavior dynamics through textual interactions.
Plain Language Accessible to non-experts
Imagine you have a virtual friend named LifeSim. This friend is incredibly smart and can predict your thoughts and needs by observing your behavior and environment. For example, when you're at home, it knows you might need some relaxing music, and when you're at work, it reminds you of important meetings. LifeSim is like a super-intelligent assistant that not only answers your questions but understands your life trajectory and preferences to provide personalized advice.
To achieve this, LifeSim uses something called the Belief-Desire-Intention (BDI) model. This is like its brain, helping it understand your thoughts and desires in different situations. By observing your behavior and environment, it can generate a coherent life trajectory and predict your possible needs and intentions.
LifeSim also uses something called an event engine to generate life trajectories. This is like its memory, helping it remember your past experiences and preferences. In this way, it can provide personalized advice in different life scenarios.
In short, LifeSim is like a super-intelligent friend that can predict your needs and intentions by observing your behavior and environment, providing personalized advice. Isn't that cool?
ELI14 Explained like you're 14
Hey there! Have you ever thought about what it would be like to have a super-smart assistant that understands your thoughts and needs? That's LifeSim, a super-intelligent virtual assistant!
LifeSim is like a friend who can read your mind. It guesses what you're thinking and what you want by watching your actions and the world around you. For instance, when you're at school, it knows you might need some study materials, and when you're at home, it reminds you to take a break.
To do this, LifeSim uses something called the Belief-Desire-Intention (BDI) model. This is like its brain, helping it understand your thoughts and desires in different situations. By observing your behavior and environment, it can generate a coherent life trajectory and predict your possible needs and intentions.
In short, LifeSim is like a super-intelligent friend that can predict your needs and intentions by observing your behavior and environment, providing personalized advice. Isn't that cool?
Glossary
Belief-Desire-Intention Model (BDI)
A psychological model used to simulate user cognition by describing the user's internal reasoning process through beliefs, desires, and intentions.
Used in LifeSim to generate user life trajectories and intention-driven interactive behaviors.
Large Language Model (LLM)
A model trained on vast amounts of data capable of generating natural language text and performing complex language tasks.
Used to generate event hypotheses and simulate user behavior.
User Behavior Engine
A component that generates responses aligned with user cognition and external contexts.
Used in LifeSim to simulate user interactive behaviors.
Event Engine
A component that generates life trajectories, guided by the BDI model to produce user life events.
Used in LifeSim to simulate user life trajectories.
Personalized Assistant
An intelligent assistant capable of providing personalized advice based on user preferences and needs.
LifeSim-Eval is used to evaluate the capabilities of personalized assistants.
Implicit Intention
Needs or desires not explicitly expressed by the user but inferred through context.
Used in LifeSim-Eval to assess models' ability to recognize and complete implicit intentions.
Long-Horizon
Involving user-assistant interactions over extended periods, considering long-term user preferences and history.
LifeSim provides a high-fidelity long-horizon user-assistant interaction simulation framework.
Multi-Turn Interaction
An interaction method involving multiple dialogue turns, allowing for more complex user-assistant exchanges.
LifeSim-Eval adopts a multi-turn interactive method to assess model capabilities.
User Profile
A collection of information about user demographics, personality traits, and long-term preferences.
Used to initialize the user's long-term belief state.
Event Hypothesis
Event predictions generated based on the user's long-term beliefs and recent life experiences.
Used to generate the user's short-term beliefs.
Open Questions Unanswered questions from this research
- 1 How can LifeSim be applied in high-stakes domains like healthcare and legal consultation? These areas require more rigorous domain knowledge and complex regulatory and ethical constraints.
- 2 How to integrate multimodal information to enhance the realism of LifeSim simulations? Multimodal information includes visual context or physiological signals, which could provide richer insights into user intentions and emotional states.
- 3 How to improve implicit intention recognition and long-term user modeling? Current models show significant limitations in handling implicit intentions and long-term user preference modeling.
- 4 How to collect real-world user data without violating privacy? Privacy and ethical considerations constrain the acquisition of real-world user data.
- 5 How to implement more sophisticated user preference modeling methods in LifeSim? Current profile memory offers limited benefits for personalization, indicating the need for more complex preference reasoning methods.
Applications
Immediate Applications
Personalized Recommendation Systems
Improve the personalization capabilities of recommendation systems by simulating user behavior and preferences with LifeSim, enhancing user satisfaction.
Smart Home Assistants
Optimize the automation control and personalized services of smart home devices by utilizing user life trajectories and intentions generated by LifeSim.
Personalized Learning in Education
Provide personalized learning suggestions and resources by simulating students' learning trajectories and preferences, enhancing learning outcomes.
Long-term Vision
Personalized Health Management in Healthcare
Improve patient health outcomes by providing personalized health management advice through simulating patients' health trajectories and preferences.
Intelligent Assistants in Legal Consultation
Enhance the efficiency and accuracy of legal services by providing personalized legal advice through simulating users' legal consultation needs and preferences.
Abstract
The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.
References (20)
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu et al.
Personalized Large Language Model Assistant with Evolving Conditional Memory
Ruifeng Yuan, Shichao Sun, Zili Wang et al.
DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversation Systems
Jiho Kim, Woosog Chay, Hyeonji Hwang et al.
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator
Zhihao Fan, Jialong Tang, Wei Chen et al.
Intention, Plans, and Practical Reason
Hugh Mccann, M. Bratman
LIFELONG SOTOPIA: Evaluating Social Intelligence of Language Agents Over Lifelong Social Interactions
Hitesh Goel, Hao Zhu
From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment
Jia-Nan Li, Jian Guan, Songhao Wu et al.
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale
Bowen Jiang, Zhuoqun Hao, Young-Min Cho et al.
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Weixiang Zhao, Xingyu Sui, Yulin Hu et al.
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs
Siyan Zhao, Mingyi Hong, Yang Liu et al.
AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation
Ming Wang, Peidong Wang, L. Wu et al.
WildChat: 1M ChatGPT Interaction Logs in the Wild
Wenting Zhao, Xiang Ren, J. Hessel et al.
IEEE TRANSACTIONS ON SYSTEMS , MAN , AND CYBERNETICS : SYSTEMS 1 Modeling User Activity Preference by Leveraging User Spatial Temporal Characteristics in LBSNs
Dingqi Yang, Daqing Zhang, V. Zheng et al.
SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users
Xinnong Zhang, Jiayu Lin, Xinyi Mou et al.
PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants
Zheng Zhao, Clara Vania, Subhradeep Kayal et al.
Relevance Theory
F. L. Piparo, M. Carapezza
GLOBEM Dataset: Multi-Year Datasets for Longitudinal Human Behavior Modeling Generalization
Xuhai Xu, Han Zhang, Yasaman S. Sefidgar et al.
Generating Daily Activities with Need Dynamics
Yuan Yuan, Jingtao Ding, Huandong Wang et al.
others
Kenneth N. Timmis, Juan Luis Ceada Ramos, Sang Yup Lee et al.