ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation

TL;DR

ReCast framework improves Pass@1 by 36.6% in generative recommendation, optimizing sparse-hit signals.

cs.LG πŸ”΄ Advanced 2026-04-24 33 views
Peiyan Zhang Hanmo Liu Chengxuan Tong Yuxia Wu Wei Guo Yong Liu
reinforcement learning generative recommendation signal reconstruction contrastive learning system optimization

Key Findings

Methodology

ReCast is a repair-then-contrast learning-signal framework designed for sparse-hit generative recommendation. It restores minimal learnability for all-zero groups by injecting a valid positive anchor and applies a boundary-focused contrastive update on the strongest positive and hardest negative, replacing full-group reward normalization. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width.

Key Results

  • ReCast consistently outperforms OpenOneRec-RL across multiple generative recommendation tasks, achieving up to 36.6% relative improvement in Pass@1. With the same budget, ReCast reaches the baseline's target performance using only 4.1% of the rollout budget, and this advantage widens with model scale.
  • At the system level, ReCast reduces actor-side update time by 16.60x, lowers peak allocated memory by 16.5%, and improves actor MFU by 14.2%.
  • Mechanism analysis shows that ReCast mitigates persistent all-zero/single-hit regimes, restores learnability when natural positives are scarce, and converts otherwise wasted rollout budget into more stable policy updates.

Significance

ReCast framework holds significant value in the field of generative recommendation. It not only enhances recommendation quality but also substantially improves the scaling efficiency of RL post-training. By addressing signal degeneracy in sparse-hit scenarios, ReCast offers new insights into reinforcement learning for generative recommendation, especially in situations where natural positives are scarce, effectively restoring learnability and optimizing policy updates.

Technical Contribution

ReCast's technical contributions lie in its innovative signal construction method, which differs from traditional group-level reward normalization by employing a boundary-focused contrastive update strategy. This approach not only improves the quality of learning signals but also reduces computational costs. By partially decoupling search width from update width, ReCast excels in large-scale models and sparse-hit scenarios.

Novelty

ReCast is the first to introduce a repair-then-contrast signal design in generative recommendation, addressing signal degeneracy under sparse-hit conditions. Compared to existing methods, ReCast focuses not only on reward assignment but also on constructing learnable optimization events from sparse, structured supervision.

Limitations

  • ReCast's performance in multi-objective or delayed-feedback environments has not been validated, which may affect its applicability in more complex scenarios.
  • The repair mechanism may introduce bias when the model naturally forms learnable boundaries, requiring further optimization.
  • The current repair strategy may become unnecessary in stronger backbone networks, necessitating the development of an adaptive RL-SFT interface.

Future Work

Future work could explore ReCast's application in multi-objective and delayed-feedback environments. Additionally, developing an adaptive RL-SFT interface to dynamically adjust repair and signal update strategies could further enhance the model's adaptability and performance.

AI Executive Summary

In generative recommendation systems, traditional reinforcement learning methods often assume that sampled groups are already usable learning signals. However, in sparse-hit scenarios, this assumption frequently fails as many sampled groups never become trainable learning units.

The ReCast framework addresses this issue through a repair-then-contrast signal design. Initially, ReCast restores minimal learnability for all-zero groups by injecting a valid positive anchor. It then applies a boundary-focused contrastive update, updating only the strongest positive and hardest negative, replacing full-group reward normalization.

Experimental results demonstrate that ReCast consistently outperforms existing methods across multiple generative recommendation tasks, achieving up to 36.6% relative improvement in Pass@1. With the same budget, ReCast reaches the baseline's target performance using only 4.1% of the rollout budget, and this advantage widens with model scale.

ReCast not only enhances recommendation quality but also significantly improves the scaling efficiency of RL post-training. At the system level, ReCast reduces actor-side update time by 16.60x, lowers peak allocated memory by 16.5%, and improves actor MFU by 14.2%.

However, ReCast's performance in multi-objective or delayed-feedback environments has not been validated, which may affect its applicability in more complex scenarios. Future work could explore ReCast's application in these environments and develop an adaptive RL-SFT interface to dynamically adjust repair and signal update strategies, further enhancing the model's adaptability and performance.

Deep Analysis

Background

Generative recommendation systems have garnered significant attention in recent years, focusing on directly generating recommendation items through generative models rather than traditional candidate scoring. Reinforcement learning (RL) is widely applied to optimize metrics such as hit rate. However, existing methods largely inherit generic group-based RL approaches, assuming that sampled groups are already usable learning signals. In sparse-hit scenarios, this assumption often fails as many sampled groups never become trainable learning units.

Core Problem

In sparse-hit generative recommendation, many sampled groups never become trainable learning units. All-zero groups are unlearnable due to the lack of a positive-negative boundary, and single-hit groups, while trainable, are fragile, with updates dominated by one accidental hit and noisy group statistics. Binary supervision further collapses structured near misses into the same zero-reward class as fully irrelevant outputs.

Innovation

The ReCast framework addresses signal degeneracy under sparse-hit conditions through a repair-then-contrast signal design. Initially, ReCast restores minimal learnability for all-zero groups by injecting a valid positive anchor. It then applies a boundary-focused contrastive update, updating only the strongest positive and hardest negative, replacing full-group reward normalization. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width.

Methodology

  • οΏ½οΏ½ Repair all-zero groups: Restore minimal learnability by injecting a valid positive anchor.
  • οΏ½οΏ½ Boundary contrastive update: Update only the strongest positive and hardest negative, replacing full-group reward normalization.
  • οΏ½οΏ½ Maintain the outer RL framework unchanged, modifying only within-group signal construction.
  • οΏ½οΏ½ Partially decouple search width from update width, enhancing scaling efficiency.

Experiments

Experiments were conducted across multiple generative recommendation tasks, including short video recommendation, ad recommendation, and product recommendation. The baseline used was OpenOneRec-RL, which applies GRPO-style group-level reward normalization in its RL stage. The experiments evaluated ReCast's performance improvement under the same budget and analyzed the roles of repair and boundary-focused updating.

Results

ReCast consistently outperforms OpenOneRec-RL across multiple generative recommendation tasks, achieving up to 36.6% relative improvement in Pass@1. With the same budget, ReCast reaches the baseline's target performance using only 4.1% of the rollout budget, and this advantage widens with model scale. At the system level, ReCast reduces actor-side update time by 16.60x, lowers peak allocated memory by 16.5%, and improves actor MFU by 14.2%.

Applications

ReCast can be directly applied to generative recommendation systems, particularly in scenarios where natural positives are scarce. By enhancing recommendation quality and system efficiency, ReCast holds significant commercial value in fields such as ad recommendation and product recommendation.

Limitations & Outlook

ReCast's performance in multi-objective or delayed-feedback environments has not been validated, which may affect its applicability in more complex scenarios. The repair mechanism may introduce bias when the model naturally forms learnable boundaries, requiring further optimization. The current repair strategy may become unnecessary in stronger backbone networks, necessitating the development of an adaptive RL-SFT interface.

Plain Language Accessible to non-experts

Imagine you're in a kitchen, preparing a complex dish. The traditional method is to prepare all the ingredients at once, hoping they'll come together perfectly to create a delicious dish. But in reality, sometimes we find that some ingredients aren't fresh enough or don't pair well, leading to a dish that doesn't taste as expected.

ReCast is like a smart chef who continuously adjusts the combination of ingredients during cooking to ensure each step maximizes the potential of the ingredients. First, he checks all the ingredients to ensure none are completely useless. If he finds any that aren't fresh enough, he'll use some spices to enhance their flavor.

Next, he focuses on the ingredients that will most enhance the dish's flavor, rather than spreading his attention evenly across all ingredients. This way, he can ensure each dish reaches its best taste, rather than relying on luck.

Through this method, ReCast not only improves the overall quality of the dish but also reduces wasted ingredients and time. It's like a magician in the kitchen, able to create amazing flavors under limited conditions.

ELI14 Explained like you're 14

Hey there, friends! Today I want to tell you about something super cool called ReCast. Imagine you're playing a game, and your goal is to find treasure hidden on a map. The traditional method is to explore the entire map at once, hoping to find the treasure. But sometimes, this method isn't very efficient because the map is too big and the treasure is too scarce.

ReCast is like a smart explorer who first checks the map to see where the treasure might be hidden. If he finds a place with no signs of treasure, he'll use some clues to help himself find potential treasure locations.

Then, he focuses on the places most likely to have treasure, rather than spreading his time evenly across the entire map. This way, he can find the treasure faster, rather than relying on luck.

Through this method, ReCast not only increases the chance of finding treasure but also reduces wasted time and effort. It's like a magic tool in the game, helping you reach your goal faster under limited conditions.

Glossary

ReCast

ReCast is a repair-then-contrast learning-signal framework designed for sparse-hit generative recommendation. It restores minimal learnability for all-zero groups by injecting a valid positive anchor and applies a boundary-focused contrastive update.

In the paper, ReCast is used to address signal degeneracy under sparse-hit conditions.

Reinforcement Learning

A machine learning method that learns policies by interacting with the environment to maximize cumulative rewards.

In generative recommendation, RL is used to optimize metrics such as hit rate.

Generative Recommendation

Directly generating recommendation items through generative models rather than traditional candidate scoring.

ReCast is applied in generative recommendation tasks to improve recommendation quality.

Signal Degeneracy

In sparse-hit scenarios, many sampled groups never become trainable learning units, leading to signal degeneracy.

ReCast addresses this issue through a repair-then-contrast signal design.

Boundary Contrastive Update

Updating only the strongest positive and hardest negative, replacing full-group reward normalization.

ReCast uses this method to improve the quality of learning signals.

All-zero Group

In sparse-hit scenarios, a group where all responses receive zero rewards, making it unlearnable due to the lack of a positive-negative boundary.

ReCast restores minimal learnability for all-zero groups by injecting a valid positive anchor.

Single-hit Group

In sparse-hit scenarios, a group with only one positive sample, where updates are dominated by one accidental hit and noisy group statistics.

ReCast improves the stability of single-hit groups through boundary contrastive updates.

Repair Mechanism

Restoring minimal learnability for all-zero groups by injecting a valid positive anchor.

ReCast's repair mechanism addresses signal degeneracy issues.

System Efficiency

Refers to the utilization efficiency of time, memory, and computational resources under the same budget.

ReCast significantly improves system efficiency by reducing actor-side update time and memory usage.

Sparse-hit

Scenarios in generative recommendation where natural positives are scarce.

ReCast optimizes recommendation quality in sparse-hit scenarios through a repair-then-contrast signal design.

Open Questions Unanswered questions from this research

  • 1 ReCast's performance in multi-objective or delayed-feedback environments has not been validated, which may affect its applicability in more complex scenarios. Future research needs to explore its robustness in different environments.
  • 2 The repair mechanism may introduce bias when the model naturally forms learnable boundaries, requiring further optimization. Researchers need to develop an adaptive RL-SFT interface to dynamically adjust repair and signal update strategies.
  • 3 The current repair strategy may become unnecessary in stronger backbone networks, necessitating the exploration of more flexible repair strategies to adapt to different model scales and task requirements.
  • 4 The performance of ReCast's boundary contrastive update strategy in more complex recommendation tasks is unclear, requiring further research into its applicability across multiple recommendation metrics.
  • 5 The stability and performance retention of ReCast during long-term training need to be verified, especially in large-scale datasets and high-complexity tasks.

Applications

Immediate Applications

Ad Recommendation Optimization

By enhancing recommendation quality and system efficiency, ReCast can be directly applied to ad recommendation systems, helping advertisers more accurately reach target users and improve ad conversion rates.

Product Recommendation Enhancement

In e-commerce platforms, ReCast can be used to optimize product recommendations, enhancing user shopping experience and platform sales.

Short Video Recommendation

ReCast performs excellently in short video recommendation, helping platforms increase user engagement and watch time, thereby boosting ad revenue.

Long-term Vision

Cross-platform Recommendation Systems

ReCast's efficiency and adaptability make it a potential core technology for cross-platform recommendation systems, supporting personalized recommendations for various content forms.

Intelligent Content Generation

By optimizing generative recommendation, ReCast can drive the development of intelligent content generation technologies, supporting automated content creation and distribution to enhance user experience.

Abstract

Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast learning-signal framework that first restores minimal learnability for all-zero groups and then replaces full-group reward normalization with a boundary-focused contrastive update on the strongest positive and the hardest negative. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width. Across multiple generative recommendation tasks, ReCast consistently outperforms OpenOneRec-RL, achieving up to 36.6% relative improvement in Pass@1. Its matched-budget advantage is substantially larger: ReCast reaches the baseline's target performance with only 4.1% of the rollout budget, and this advantage widens with model scale. The same design also yields direct system-level gains, reducing actor-side update time by 16.60x, lowering peak allocated memory by 16.5%, and improving actor MFU by 14.2%. Mechanism analysis shows that ReCast mitigates the persistent all-zero / single-hit regime, restores learnability when natural positives are scarce, and converts otherwise wasted rollout budget into more stable policy updates. These results suggest that, for generative recommendation, the decisive RL problem is not only how to assign rewards, but how to construct learnable optimization events from sparse, structured supervision.

cs.LG cs.AI cs.IR

References (20)

Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

Hongxun Ding, Keqin Bao, Jizhi Zhang et al.

2026 1 citations ⭐ Influential View Analysis β†’

Recommender Systems with Generative Retrieval

Shashank Rajput, Nikhil Mehta, Anima Singh et al.

2023 245 citations ⭐ Influential View Analysis β†’

EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration

Minjie Hong, Yan Xia, Zehan Wang et al.

2025 29 citations ⭐ Influential View Analysis β†’

OpenOneRec Technical Report

Guorui Zhou, Honghui Bao, Jiaming Huang et al.

2025 4 citations ⭐ Influential View Analysis β†’

OneRec Technical Report

Guorui Zhou, Jiaxin Deng, Jinghao Zhang et al.

2025 20 citations View Analysis β†’

OxygenREC: An Instruction-Following Generative Framework for E-commerce Recommendation

Xuegang Hao, Ming Zhang, Alex Li et al.

2025 7 citations View Analysis β†’

Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)

Shijie Geng, Shuchang Liu, Zuohui Fu et al.

2022 798 citations View Analysis β†’

M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems

Zeyu Cui, Jianxin Ma, Chang Zhou et al.

2022 272 citations View Analysis β†’

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Adam Suma, Sam Dauncey

2025 2557 citations

OneRec-Think: In-Text Reasoning for Generative Recommendation

Zhanyun Liu, Shiyao Wang, Xing-Yao Wang et al.

2025 27 citations View Analysis β†’

EAGER: Two-Stream Generative Recommender with Behavior-Semantic Collaboration

Yejin Wang, Jiahao Xun, Ming Hong et al.

2024 80 citations View Analysis β†’

UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration

Longtao Xiao, Haozhao Wang, Cheng Wang et al.

2025 11 citations View Analysis β†’

OneRec-V2 Technical Report

Guorui Zhou, Hengrui Hu, Hongtao Cheng et al.

2025 11 citations View Analysis β†’

Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation

Bowen Zheng, Yupeng Hou, Hongyu Lu et al.

2023 308 citations View Analysis β†’

SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation

Yu Xie, Xingkai Ren, Ying Qi et al.

2026 1 citations View Analysis β†’

Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning

Jiacheng Lin, Tian Wang, Kun Qian

2025 24 citations View Analysis β†’

Reinforced Latent Reasoning for LLM-based Recommendation

Yang Zhang, Wenxin Xu, Xiaoyan Zhao et al.

2025 36 citations View Analysis β†’

Reasoning over Semantic IDs Enhances Generative Recommendation

Y. He, Yanfan Sun, Junfei Tan et al.

2026 1 citations View Analysis β†’

GFlowGR: Fine-tuning Generative Recommendation Frameworks with Generative Flow Networks

Yejing Wang, Shengyu Zhou, Jinyu Lu et al.

2025 2 citations View Analysis β†’

Learnable Item Tokenization for Generative Recommendation

Wenjie Wang, Honghui Bao, Xinyu Lin et al.

2024 148 citations View Analysis β†’