LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

TL;DR

LifeSim simulates user cognition via BDI model to enhance personalized assistant evaluation.

cs.CL 🔴 Advanced 2026-03-13 13 views

Feiyu Duan Xuanjing Huang Zhongyu Wei

user simulation personalized assistant BDI model long-horizon intention recognition

Key Findings

Methodology

LifeSim models user cognition through the Belief-Desire-Intention (BDI) model within physical environments to generate coherent life trajectories and simulate intention-driven user interactive behaviors. LifeSim-Eval is a comprehensive benchmark covering 8 life domains and 1,200 diverse scenarios, adopting a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses.

Key Results

Experiments reveal significant limitations of current large language models in handling implicit intention and long-term user preference modeling. Specifically, on the LifeSim-Eval benchmark, GPT-5 achieved 79.5% accuracy in explicit intention recognition but only 52.2% in implicit intention recognition.
In long-horizon settings, while models maintain stable performance on explicit intentions, implicit intention completion rates significantly decline as conversation history grows.
Simple profile memory offers limited benefits, indicating that effective personalization requires stable preference reasoning beyond simple retention.

Significance

The introduction of LifeSim and LifeSim-Eval provides a more realistic testing platform for evaluating personalized assistants. By simulating user cognition and life trajectories, this research fills the gap in existing benchmarks in capturing the complexity of external contexts and users' cognitive states. It not only advances academic research in personalized intelligence but also offers new evaluation tools for developing smarter AI assistants in the industry.

Technical Contribution

LifeSim introduces the BDI model in user simulation, combining an event engine to generate life trajectories and a user behavior engine to produce responses aligned with user cognition and external contexts. This approach contrasts sharply with existing static or short-context datasets, providing a high-fidelity long-horizon user-assistant interaction simulation framework.

Novelty

LifeSim is the first to integrate the BDI model with physical environments in user simulation, generating coherent life trajectories and intention-driven interactive behaviors. Compared to existing benchmarks, it offers higher fidelity in multi-scenario and long-horizon personalized evaluation.

Limitations

Current models show significant limitations in handling implicit intentions and long-term user preference modeling, particularly as conversation history grows.
LifeSim-Eval currently focuses on everyday life scenarios and does not yet cover high-stakes domains such as healthcare and legal consultation.
Lacks multimodal user signals, primarily simulating user behavior dynamics through textual interactions.

Future Work

Future research directions include extending LifeSim-Eval to cover high-stakes domains, integrating multimodal information to enhance simulation realism, and developing more sophisticated user preference modeling methods to improve implicit intention recognition and long-term user modeling.

AI Executive Summary

With the rapid advancement of large language models (LLMs), the vision of universal AI assistants is becoming increasingly achievable. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors.

Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark covering 8 life domains and 1,200 diverse scenarios, adopting a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Our experiments reveal that current large language models face significant limitations in handling implicit intention and long-term user preference modeling.

LifeSim uses the BDI model combined with an event engine to generate life trajectories and a user behavior engine to produce responses aligned with user cognition and external contexts. This approach contrasts sharply with existing static or short-context datasets, providing a high-fidelity long-horizon user-assistant interaction simulation framework.

Deep Analysis

Background

In recent years, with the rapid development of large language models (LLMs), the vision of universal AI assistants is becoming increasingly achievable. Existing research has primarily focused on optimizing the model's ability to handle complex and knowledge-intensive tasks, as well as improving its social intelligence. However, there remains a clear gap between current evaluation frameworks and real-world scenarios, constraining advances in personalized intelligence. Ideal user-assistant interactions fundamentally differ from isolated question answering and involve complex external environments and users' cognitive states. User needs vary in terms of situational factors such as time, location, weather, and ongoing life events. User intentions arise from internal cognitive states, jointly shaped by evolving life experiences and relatively stable personalities and preferences. Real-world user data are constrained by privacy and ethical considerations, and publicly available interaction logs spanning multiple years and diverse scenarios remain extremely scarce. Therefore, establishing a realistic testbed with long-term user-assistant interactions at scale poses a fundamental problem.

Core Problem

Existing benchmarks for personalized assistants fail to capture the complexity of external contexts and users' cognitive states, leading to a misalignment with real-world user-assistant interactions. This mismatch constrains advances in personalized intelligence, as ideal user-assistant interactions involve complex external environments and users' cognitive states, rather than isolated question answering. User needs vary in terms of situational factors such as time, location, weather, and ongoing life events. User intentions arise from internal cognitive states, jointly shaped by evolving life experiences and relatively stable personalities and preferences.

Innovation

The core innovation of LifeSim lies in its use of the Belief-Desire-Intention (BDI) model to simulate user cognition within physical environments, generating coherent life trajectories and simulating intention-driven user interactive behaviors. Compared to existing static or short-context datasets, LifeSim provides a high-fidelity long-horizon user-assistant interaction simulation framework. LifeSim-Eval is a comprehensive benchmark covering 8 life domains and 1,200 diverse scenarios, adopting a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses.

Methodology

�� LifeSim models user cognition through the Belief-Desire-Intention (BDI) model within physical environments.
�� An event engine generates life trajectories, and a user behavior engine produces responses aligned with user cognition and external contexts.
�� LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, adopting a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses.
�� Experiments reveal significant limitations of current large language models in handling implicit intention and long-term user preference modeling.

Experiments

The experimental design includes evaluating the performance of various open-source and proprietary models on the LifeSim-Eval benchmark. The benchmark covers 8 life domains and 1,200 diverse scenarios, adopting a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Experimental results show that while models perform well in explicit intention recognition, there is significant room for improvement in implicit intention recognition and long-term user preference modeling.

Results

Experimental results show that while models perform well in explicit intention recognition, there is significant room for improvement in implicit intention recognition and long-term user preference modeling. Specifically, on the LifeSim-Eval benchmark, GPT-5 achieved 79.5% accuracy in explicit intention recognition but only 52.2% in implicit intention recognition. In long-horizon settings, while models maintain stable performance on explicit intentions, implicit intention completion rates significantly decline as conversation history grows.

Applications

LifeSim and LifeSim-Eval provide a more realistic testing platform for evaluating personalized assistants. By simulating user cognition and life trajectories, this research fills the gap in existing benchmarks in capturing the complexity of external contexts and users' cognitive states. It not only advances academic research in personalized intelligence but also offers new evaluation tools for developing smarter AI assistants in the industry.

Limitations & Outlook

Current models show significant limitations in handling implicit intentions and long-term user preference modeling, particularly as conversation history grows. LifeSim-Eval currently focuses on everyday life scenarios and does not yet cover high-stakes domains such as healthcare and legal consultation. Lacks multimodal user signals, primarily simulating user behavior dynamics through textual interactions.

Plain Language Accessible to non-experts

Imagine you have a virtual friend named LifeSim. This friend is incredibly smart and can predict your thoughts and needs by observing your behavior and environment. For example, when you're at home, it knows you might need some relaxing music, and when you're at work, it reminds you of important meetings. LifeSim is like a super-intelligent assistant that not only answers your questions but understands your life trajectory and preferences to provide personalized advice.

To achieve this, LifeSim uses something called the Belief-Desire-Intention (BDI) model. This is like its brain, helping it understand your thoughts and desires in different situations. By observing your behavior and environment, it can generate a coherent life trajectory and predict your possible needs and intentions.

LifeSim also uses something called an event engine to generate life trajectories. This is like its memory, helping it remember your past experiences and preferences. In this way, it can provide personalized advice in different life scenarios.

In short, LifeSim is like a super-intelligent friend that can predict your needs and intentions by observing your behavior and environment, providing personalized advice. Isn't that cool?

ELI14 Explained like you're 14

Hey there! Have you ever thought about what it would be like to have a super-smart assistant that understands your thoughts and needs? That's LifeSim, a super-intelligent virtual assistant!

LifeSim is like a friend who can read your mind. It guesses what you're thinking and what you want by watching your actions and the world around you. For instance, when you're at school, it knows you might need some study materials, and when you're at home, it reminds you to take a break.

To do this, LifeSim uses something called the Belief-Desire-Intention (BDI) model. This is like its brain, helping it understand your thoughts and desires in different situations. By observing your behavior and environment, it can generate a coherent life trajectory and predict your possible needs and intentions.

In short, LifeSim is like a super-intelligent friend that can predict your needs and intentions by observing your behavior and environment, providing personalized advice. Isn't that cool?

Glossary

Belief-Desire-Intention Model (BDI)

A psychological model used to simulate user cognition by describing the user's internal reasoning process through beliefs, desires, and intentions.

Used in LifeSim to generate user life trajectories and intention-driven interactive behaviors.

Large Language Model (LLM)

A model trained on vast amounts of data capable of generating natural language text and performing complex language tasks.

Used to generate event hypotheses and simulate user behavior.

User Behavior Engine

A component that generates responses aligned with user cognition and external contexts.

Used in LifeSim to simulate user interactive behaviors.

Event Engine

A component that generates life trajectories, guided by the BDI model to produce user life events.

Used in LifeSim to simulate user life trajectories.

Personalized Assistant

An intelligent assistant capable of providing personalized advice based on user preferences and needs.

LifeSim-Eval is used to evaluate the capabilities of personalized assistants.

Implicit Intention

Needs or desires not explicitly expressed by the user but inferred through context.

Used in LifeSim-Eval to assess models' ability to recognize and complete implicit intentions.

Long-Horizon

Involving user-assistant interactions over extended periods, considering long-term user preferences and history.

LifeSim provides a high-fidelity long-horizon user-assistant interaction simulation framework.

Multi-Turn Interaction

An interaction method involving multiple dialogue turns, allowing for more complex user-assistant exchanges.

LifeSim-Eval adopts a multi-turn interactive method to assess model capabilities.

User Profile

A collection of information about user demographics, personality traits, and long-term preferences.

Used to initialize the user's long-term belief state.

Event Hypothesis

Event predictions generated based on the user's long-term beliefs and recent life experiences.

Used to generate the user's short-term beliefs.

Open Questions Unanswered questions from this research

1 How can LifeSim be applied in high-stakes domains like healthcare and legal consultation? These areas require more rigorous domain knowledge and complex regulatory and ethical constraints.
2 How to integrate multimodal information to enhance the realism of LifeSim simulations? Multimodal information includes visual context or physiological signals, which could provide richer insights into user intentions and emotional states.
3 How to improve implicit intention recognition and long-term user modeling? Current models show significant limitations in handling implicit intentions and long-term user preference modeling.
4 How to collect real-world user data without violating privacy? Privacy and ethical considerations constrain the acquisition of real-world user data.
5 How to implement more sophisticated user preference modeling methods in LifeSim? Current profile memory offers limited benefits for personalization, indicating the need for more complex preference reasoning methods.

Applications

Immediate Applications

Personalized Recommendation Systems

Improve the personalization capabilities of recommendation systems by simulating user behavior and preferences with LifeSim, enhancing user satisfaction.

Smart Home Assistants

Optimize the automation control and personalized services of smart home devices by utilizing user life trajectories and intentions generated by LifeSim.

Personalized Learning in Education

Provide personalized learning suggestions and resources by simulating students' learning trajectories and preferences, enhancing learning outcomes.

Long-term Vision

Personalized Health Management in Healthcare

Improve patient health outcomes by providing personalized health management advice through simulating patients' health trajectories and preferences.

Intelligent Assistants in Legal Consultation

Enhance the efficiency and accuracy of legal services by providing personalized legal advice through simulating users' legal consultation needs and preferences.

Abstract

The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.

cs.CL

References (20)

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 4924 citations View Analysis →

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu et al.

2024 163 citations View Analysis →

Personalized Large Language Model Assistant with Evolving Conditional Memory

Ruifeng Yuan, Shichao Sun, Zili Wang et al.

2023 15 citations View Analysis →

DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversation Systems

Jiho Kim, Woosog Chay, Hyeonji Hwang et al.

5 citations

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Zhihao Fan, Jialong Tang, Wei Chen et al.

2024 86 citations View Analysis →

Intention, Plans, and Practical Reason

Hugh Mccann, M. Bratman

1991 2957 citations

LIFELONG SOTOPIA: Evaluating Social Intelligence of Language Agents Over Lifelong Social Interactions

Hitesh Goel, Hao Zhu

2025 3 citations View Analysis →

From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment

Jia-Nan Li, Jian Guan, Songhao Wu et al.

2025 19 citations View Analysis →

Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Bowen Jiang, Zhuoqun Hao, Young-Min Cho et al.

2025 46 citations View Analysis →

Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment

Weixiang Zhao, Xingyu Sui, Yulin Hu et al.

2025 14 citations View Analysis →

Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

Siyan Zhao, Mingyi Hong, Yang Liu et al.

2025 61 citations View Analysis →

AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation

Ming Wang, Peidong Wang, L. Wu et al.

2025 9 citations View Analysis →

WildChat: 1M ChatGPT Interaction Logs in the Wild

Wenting Zhao, Xiang Ren, J. Hessel et al.

2024 452 citations View Analysis →

IEEE TRANSACTIONS ON SYSTEMS , MAN , AND CYBERNETICS : SYSTEMS 1 Modeling User Activity Preference by Leveraging User Spatial Temporal Characteristics in LBSNs

Dingqi Yang, Daqing Zhang, V. Zheng et al.

2014 779 citations

SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

Xinnong Zhang, Jiayu Lin, Xinyi Mou et al.

2025 33 citations View Analysis →

PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

Zheng Zhao, Clara Vania, Subhradeep Kayal et al.

2025 11 citations View Analysis →

Relevance Theory

F. L. Piparo, M. Carapezza

2019 262 citations

GLOBEM Dataset: Multi-Year Datasets for Longitudinal Human Behavior Modeling Generalization

Xuhai Xu, Han Zhang, Yasaman S. Sefidgar et al.

2022 63 citations View Analysis →

Generating Daily Activities with Need Dynamics

Yuan Yuan, Jingtao Ding, Huandong Wang et al.

2023 18 citations

others

Kenneth N. Timmis, Juan Luis Ceada Ramos, Sang Yup Lee et al.

1999 1486 citations

LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Belief-Desire-Intention Model (BDI)

Large Language Model (LLM)

User Behavior Engine

Event Engine

Personalized Assistant

Implicit Intention

Long-Horizon

Multi-Turn Interaction

User Profile

Event Hypothesis

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Personalized Recommendation Systems

Smart Home Assistants

Personalized Learning in Education

Long-term Vision

Personalized Health Management in Healthcare

Intelligent Assistants in Legal Consultation

Abstract

References (20)

Related Papers

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection