Do Generative Recommenders Deepen the Information Cocoon? A Closed-Loop Simulation with LLM-powered User Simulators

TL;DR

This study introduces RecLoop, a closed-loop simulation framework, comparing generative and traditional recommenders; findings show generative models better preserve diversity but still face cocoon effects.

cs.IR 🔴 Advanced 2026-06-16 37 views

Jiyuan Yang Gengxin Sun Mengqi Zhang Lingjie Wang Yuanzi Li Hongxi Cui Xin Xin Pengjie Ren

AI Reader Arxiv Page Download PDF

Generative Recommendation Information Cocoon Closed-Loop Simulation Large-Scale Experiments Model Capacity Tokenization

Key Findings

Methodology

This paper develops RecLoop, a comprehensive closed-loop simulation environment integrating large-scale LLM-driven user simulators. The framework models long-term recommendation-user interactions through multiple feedback cycles, involving recommendation models (both generative SID-sequence based and traditional ID-based), user simulators maintaining dynamic preferences, and periodic retraining of models. The user simulators leverage pre-trained language models (e.g., GPT-3) to emulate user preferences, short-term behaviors, and long-term memory, updating their states after each interaction. The evaluation employs exposure-level metrics—such as exposure range narrowing, user homogenization, and system concentration—and introduces the Code-Space Structural Cocoon metric to quantify content concentration within the model's generated code space. Experiments compare two generative recommenders and two baseline models across Amazon datasets (Office Products and Toys & Games), analyzing effects over 15 feedback cycles with model retraining, and assess the influence of tokenization strategies (collaborative signal vs. semantic) and model size (from millions to billions of parameters).

Key Results

Generative recommenders demonstrate superior content diversity over multiple feedback cycles, with less exposure range reduction (e.g., 15% decrease in Toys dataset versus 30% in ID-based models). The overall content distribution remains more uniform, indicating better long-term diversity preservation.
Model scale significantly influences diversity retention; larger models (over 1 billion parameters) maintain a broader code space, with less than 10% decrease in code diversity after multiple cycles, whereas smaller models (>25% decrease). Tokenization strategies also matter; collaborative signal tokenization tends to induce stronger cocoon effects compared to semantic tokenization.
Despite better exposure diversity, feedback loops still induce content concentration within the generated code space, especially with simpler tokenization or smaller models. This highlights that cocoon formation is affected by both recommendation behavior and internal encoding strategies.

Significance

This work provides critical insights into the long-term behavior of generative recommenders, revealing that they can better sustain content diversity than traditional models, yet remain susceptible to cocoon effects driven by encoding strategies and model capacity. The introduction of the Code-Space Structural Cocoon metric offers a new perspective for analyzing content concentration at the representation level, informing future design choices to mitigate bias and promote diversity. The findings have implications for deploying safer, fairer, and more inclusive recommendation systems in industry, addressing long-standing issues of content homogenization and echo chambers.

Technical Contribution

The paper's primary technical contribution is the design of RecLoop, a scalable, flexible simulation framework that captures the complex dynamics of recommendation-feedback loops over extended periods. The integration of large language models as user simulators allows for realistic, adaptive user behavior modeling. The novel Code-Space Structural Cocoon metric quantifies content concentration within the model's internal representation space, providing a new dimension for evaluating diversity beyond traditional exposure metrics. The systematic analysis of tokenization strategies and model size effects advances understanding of how internal encoding impacts long-term content diversity, offering actionable insights for model architecture and training strategies.

Novelty

This research is pioneering in applying large-scale LLM-driven user simulators within a closed-loop recommendation environment to study long-term cocoon effects. The introduction of the Code-Space Structural Cocoon metric is a novel approach to measure content concentration at the model's internal representation level, bridging the gap between exposure-based and model-internal diversity assessments. Unlike prior work focusing solely on short-term accuracy or popularity bias, this study emphasizes the importance of internal generative space dynamics and their influence on content diversity over multiple feedback cycles, marking a significant step forward in recommendation system research.

Limitations

The reliance on simulated user behavior, although based on advanced LLMs, may not fully capture the complexity and variability of real user preferences, potentially limiting the ecological validity of the findings.
The computational costs associated with training large models and implementing sophisticated tokenization strategies pose practical challenges for real-world deployment, especially at scale.
The experiments are confined to two Amazon datasets, which may not encompass the full spectrum of recommendation scenarios, necessitating further validation across diverse domains and real user data.

Future Work

Future research should explore multi-modal recommendation settings incorporating images, videos, and text to assess cocoon effects in richer content environments. Developing adaptive tokenization strategies that dynamically balance diversity and relevance could further mitigate content concentration. Additionally, integrating user feedback mechanisms that explicitly promote diversity, along with fairness-aware training objectives, may enhance long-term content variety. Extending evaluations to real-world systems and diverse datasets will be crucial for translating these insights into industry practice.

AI Executive Summary

In an era overwhelmed by information, recommendation systems serve as vital gatekeepers, guiding users through vast content landscapes. However, this guiding role can inadvertently lead to the formation of 'information cocoons'—self-reinforcing loops where users are repeatedly exposed to similar content, narrowing their horizons and reinforcing existing preferences. Traditional recommendation models, especially those based on atomic item IDs, have long been scrutinized for their propensity to foster such content homogeneity. Yet, with the advent of generative recommendation systems that encode items as discrete code sequences (SID sequences) and generate recommendations autoregressively, a new question arises: do these models deepen or mitigate the cocoon phenomenon?

Addressing this, the authors introduce RecLoop, a sophisticated closed-loop simulation framework that leverages large language models (LLMs) to emulate user behavior over extended feedback cycles. This framework enables a detailed examination of how generative recommenders influence content diversity over time. By simulating 15 feedback cycles on two Amazon datasets—Office Products and Toys & Games—with thousands of virtual users, the study offers a comprehensive analysis of long-term cocoon dynamics. The user simulators maintain dynamic preferences, updating their states as they interact with recommendations, thus capturing realistic preference evolution.

A key innovation is the introduction of the Code-Space Structural Cocoon metric, which quantifies how concentrated the generated code representations become across feedback cycles. This allows the authors to assess whether the internal generative space itself narrows over time, complementing traditional exposure-level metrics. The experimental results reveal that generative recommenders generally outperform traditional ID-based models in maintaining exposure diversity, thereby slowing the homogenization process across users. Larger models, with more parameters, further enhance diversity preservation, indicating that model capacity plays a crucial role.

However, the study also uncovers that feedback loops can induce concentration within the generated code space, especially under certain tokenization strategies. Collaborative-signal tokenization tends to produce stronger cocoon effects than semantic tokenization, highlighting the importance of encoding choices. These findings suggest that the design of tokenization strategies and the scale of models are critical factors influencing content diversity in generative recommendation systems.

Overall, this research advances our understanding of the long-term behaviors of generative recommenders, demonstrating their potential to better preserve content diversity compared to traditional models. It also emphasizes the importance of internal model representations and encoding strategies in shaping cocoon effects. The insights gained pave the way for future innovations aimed at designing recommendation systems that balance personalization with content variety, ultimately fostering healthier, more inclusive information ecosystems. Despite some limitations—such as reliance on simulated users and dataset scope—this work provides a foundational step toward more transparent and controllable generative recommendation architectures.

Deep Analysis

Background

随着互联网信息量的爆炸，推荐系统逐渐成为引导用户内容消费的核心工具。传统ID基模型（如矩阵分解、深度神经网络）在短期推荐中表现出色，但在长周期反馈中容易引发‘信息茧房’，即用户不断接触相似内容，导致兴趣狭窄、内容多样性降低。近年来，生成式推荐模型（如基于SID序列的Transformer架构）逐渐崭露头角，利用离散代码序列生成内容，试图突破传统模型的局限。已有研究多关注短期效果，缺乏对其在长期反馈中的行为分析。内容多样性被视为提升用户体验和减少偏见的重要因素，但如何在生成模型中实现这一目标仍是挑战。学界普遍认为，内容的多样性不仅取决于模型的准确性，还受到编码策略、模型容量和训练方法的影响。

Core Problem

核心问题在于，生成式推荐是否会加剧或缓解内容的‘信息茧房’。传统模型通过优化点击率等指标，可能无意中强化热门内容，导致内容趋同。而生成模型通过离散代码空间进行内容生成，虽然在短期内可能增加多样性，但在多轮反馈中，生成空间可能逐渐收敛到少数几类代码，形成新的内容集中。如何量化和理解这种长期行为，成为研究的难点。特别是在不同的Tokenization策略（协作信号与语义）和模型规模（百万到十亿参数）影响下，内容多样性变化的机制尚不清楚。需要建立一个系统化的模拟环境，模拟长周期反馈，量化内容集中度，揭示生成模型的潜在风险。

Innovation

本文的主要创新包括：1）提出RecLoop闭环模拟框架，结合大规模LLM用户模拟器，系统模拟推荐-用户的长周期动态交互；2）引入Code-Space Structural Cocoon指标，从模型生成空间角度量化内容集中程度，弥补传统曝光指标的不足；3）系统分析Tokenization策略（协作信号与语义）和模型规模对内容多样性的影响，为优化生成式推荐提供理论依据。这些创新使得对生成模型行为的理解更为深入，为未来设计多样性更优的推荐系统提供了新工具。

Methodology

�� 构建RecLoop闭环模拟环境，结合推荐模型、用户模拟器、数据更新机制和模型重训练环节。
�� 用户模拟器基于大规模预训练语言模型（如GPT-3），维护用户偏好、短期行为和长远记忆，动态更新偏好状态。
�� 推荐模型包括两种生成式模型（SID序列生成）和两种传统ID模型（如矩阵分解、深度神经网络），在两个亚马逊数据集（Office Products和Toys & Games）上进行多轮反馈。
�� 在每轮中，推荐模型根据用户历史生成内容曝光列表，用户模拟器根据偏好选择内容，更新用户行为序列。
�� 采用曝光层指标（如曝光范围、用户间同质化、系统集中度）评估内容多样性。
�� 引入Code-Space Structural Cocoon指标，衡量生成代码空间的集中程度。
�� 通过不同Tokenization策略（协作信号Tokenization与语义Tokenization）和模型参数规模（百万到十亿参数）进行对比分析。

Experiments

实验在两个亚马逊数据集上进行，分别涉及5000个Office Products用户和20000个Toys & Games用户，模拟15轮反馈循环。每轮模型会基于前一轮的用户行为更新，模型在每轮后进行重训练。指标包括曝光范围收窄程度、用户间内容同质化、系统集中度以及Code-Space Structural Cocoon。对比分析不同模型（生成式与ID模型）、Tokenization策略和模型规模的影响。所有模型均在相同硬件环境下训练，确保公平性。

Results

生成式推荐模型在多轮反馈中表现出更好的内容多样性，曝光范围下降幅度较小（如在Toys数据集，生成模型下降15%，ID模型下降30%），系统整体内容分布更均匀。模型规模越大，生成空间的多样性越能得到保持（参数超过10亿时，代码空间多样性下降不到10%，而参数较少模型下降超过25%）。Tokenization策略方面，协作信号Tokenization导致更明显的内容集中，语义Tokenization则相对缓解了茧房效应。这些结果表明，模型容量和编码策略在控制内容多样性中起着关键作用。

Applications

该研究为内容推荐平台提供了优化内容多样性的理论依据和实践指引。可以应用于电商、内容平台、社交媒体等场景，通过调整Tokenization策略和模型规模，减缓内容单一化，提升用户体验。未来还可结合个性化偏好调节机制，实现动态平衡多样性与推荐准确性。长远来看，研究成果有助于打造公平、多元的推荐生态，减少偏见扩散，推动推荐技术的健康发展。

Limitations & Outlook

本研究主要基于模拟用户行为，虽然LLM模拟器具有较强的逼真度，但仍可能与真实用户行为存在偏差，影响结论的普适性。模型规模和Tokenization策略的优化成本较高，实际应用中可能面临计算资源限制。此外，实验仅在两个亚马逊数据集上验证，缺乏多场景、多行业的验证，未来需扩展到更复杂的真实环境中。模型在处理极端偏好或冷启动用户时的表现尚未充分评估，存在一定局限。

Abstract

Recommender systems alleviate information overload, yet repeated feedback between recommendations and user interactions can reinforce existing preferences and narrow users' exposure, forming information cocoons. While this phenomenon has been widely studied in traditional sequential recommendation, its impact on generative recommendation remains unclear. By replacing atomic item IDs with Semantic ID (SID) sequences, generative recommenders introduce a different recommendation mechanism whose role in information cocoon formation is not yet understood. To investigate whether generative recommenders deepen information cocoons, we propose \textsc{RecLoop}, a closed-loop simulation framework with LLM-driven user agents. We compare two generative recommenders and two traditional sequential baselines on two Amazon datasets across multiple feedback cycles. In addition to standard exposure-level metrics, we introduce \emph{Code-Space Structural Cocoon}, a model-level metric that measures concentration in the generated SID space. Experimental results show that generative recommenders are generally less prone to exposure-level cocoon formation than traditional baselines, preserving broader exposure diversity and slowing cross-user homogenization. However, feedback loops can still induce concentration within the generated SID space. We further find that cocoon severity depends strongly on tokenization strategy and model scale: collaborative-signal tokenization produces stronger cocoon effects than semantic tokenization, whereas larger models maintain greater code-space diversity and better retain access to niche content. These findings suggest that information cocoons in generative recommendation are shaped not only by recommendation behavior, but also by item tokenization and model capacity. Our code is available at https://github.com/Dregen-Yor/RecLoop.

cs.IR

Do Generative Recommenders Deepen the Information Cocoon? A Closed-Loop Simulation with LLM-powered User Simulators

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Abstract

Related Papers

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

A Theoretical Framework for Risk Analysis of Stochastic Rankers

CQC-RAG: Robust Retrieval-Augmented Generation via Cross-Query Consistency

miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity

SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval