ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation
ReCast framework improves Pass@1 by 36.6% in generative recommendation, optimizing sparse-hit signals.
Key Findings
Methodology
ReCast is a repair-then-contrast learning-signal framework designed for sparse-hit generative recommendation. It restores minimal learnability for all-zero groups by injecting a valid positive anchor and applies a boundary-focused contrastive update on the strongest positive and hardest negative, replacing full-group reward normalization. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width.
Key Results
- ReCast consistently outperforms OpenOneRec-RL across multiple generative recommendation tasks, achieving up to 36.6% relative improvement in Pass@1. With the same budget, ReCast reaches the baseline's target performance using only 4.1% of the rollout budget, and this advantage widens with model scale.
- At the system level, ReCast reduces actor-side update time by 16.60x, lowers peak allocated memory by 16.5%, and improves actor MFU by 14.2%.
- Mechanism analysis shows that ReCast mitigates persistent all-zero/single-hit regimes, restores learnability when natural positives are scarce, and converts otherwise wasted rollout budget into more stable policy updates.
Significance
ReCast framework holds significant value in the field of generative recommendation. It not only enhances recommendation quality but also substantially improves the scaling efficiency of RL post-training. By addressing signal degeneracy in sparse-hit scenarios, ReCast offers new insights into reinforcement learning for generative recommendation, especially in situations where natural positives are scarce, effectively restoring learnability and optimizing policy updates.
Technical Contribution
ReCast's technical contributions lie in its innovative signal construction method, which differs from traditional group-level reward normalization by employing a boundary-focused contrastive update strategy. This approach not only improves the quality of learning signals but also reduces computational costs. By partially decoupling search width from update width, ReCast excels in large-scale models and sparse-hit scenarios.
Novelty
ReCast is the first to introduce a repair-then-contrast signal design in generative recommendation, addressing signal degeneracy under sparse-hit conditions. Compared to existing methods, ReCast focuses not only on reward assignment but also on constructing learnable optimization events from sparse, structured supervision.
Limitations
- ReCast's performance in multi-objective or delayed-feedback environments has not been validated, which may affect its applicability in more complex scenarios.
- The repair mechanism may introduce bias when the model naturally forms learnable boundaries, requiring further optimization.
- The current repair strategy may become unnecessary in stronger backbone networks, necessitating the development of an adaptive RL-SFT interface.
Future Work
Future work could explore ReCast's application in multi-objective and delayed-feedback environments. Additionally, developing an adaptive RL-SFT interface to dynamically adjust repair and signal update strategies could further enhance the model's adaptability and performance.
AI Executive Summary
In generative recommendation systems, traditional reinforcement learning methods often assume that sampled groups are already usable learning signals. However, in sparse-hit scenarios, this assumption frequently fails as many sampled groups never become trainable learning units.
The ReCast framework addresses this issue through a repair-then-contrast signal design. Initially, ReCast restores minimal learnability for all-zero groups by injecting a valid positive anchor. It then applies a boundary-focused contrastive update, updating only the strongest positive and hardest negative, replacing full-group reward normalization.
Experimental results demonstrate that ReCast consistently outperforms existing methods across multiple generative recommendation tasks, achieving up to 36.6% relative improvement in Pass@1. With the same budget, ReCast reaches the baseline's target performance using only 4.1% of the rollout budget, and this advantage widens with model scale.
ReCast not only enhances recommendation quality but also significantly improves the scaling efficiency of RL post-training. At the system level, ReCast reduces actor-side update time by 16.60x, lowers peak allocated memory by 16.5%, and improves actor MFU by 14.2%.
However, ReCast's performance in multi-objective or delayed-feedback environments has not been validated, which may affect its applicability in more complex scenarios. Future work could explore ReCast's application in these environments and develop an adaptive RL-SFT interface to dynamically adjust repair and signal update strategies, further enhancing the model's adaptability and performance.
Deep Analysis
Background
Generative recommendation systems have garnered significant attention in recent years, focusing on directly generating recommendation items through generative models rather than traditional candidate scoring. Reinforcement learning (RL) is widely applied to optimize metrics such as hit rate. However, existing methods largely inherit generic group-based RL approaches, assuming that sampled groups are already usable learning signals. In sparse-hit scenarios, this assumption often fails as many sampled groups never become trainable learning units.
Core Problem
In sparse-hit generative recommendation, many sampled groups never become trainable learning units. All-zero groups are unlearnable due to the lack of a positive-negative boundary, and single-hit groups, while trainable, are fragile, with updates dominated by one accidental hit and noisy group statistics. Binary supervision further collapses structured near misses into the same zero-reward class as fully irrelevant outputs.
Innovation
The ReCast framework addresses signal degeneracy under sparse-hit conditions through a repair-then-contrast signal design. Initially, ReCast restores minimal learnability for all-zero groups by injecting a valid positive anchor. It then applies a boundary-focused contrastive update, updating only the strongest positive and hardest negative, replacing full-group reward normalization. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width.
Methodology
- οΏ½οΏ½ Repair all-zero groups: Restore minimal learnability by injecting a valid positive anchor.
- οΏ½οΏ½ Boundary contrastive update: Update only the strongest positive and hardest negative, replacing full-group reward normalization.
- οΏ½οΏ½ Maintain the outer RL framework unchanged, modifying only within-group signal construction.
- οΏ½οΏ½ Partially decouple search width from update width, enhancing scaling efficiency.
Experiments
Experiments were conducted across multiple generative recommendation tasks, including short video recommendation, ad recommendation, and product recommendation. The baseline used was OpenOneRec-RL, which applies GRPO-style group-level reward normalization in its RL stage. The experiments evaluated ReCast's performance improvement under the same budget and analyzed the roles of repair and boundary-focused updating.
Results
ReCast consistently outperforms OpenOneRec-RL across multiple generative recommendation tasks, achieving up to 36.6% relative improvement in Pass@1. With the same budget, ReCast reaches the baseline's target performance using only 4.1% of the rollout budget, and this advantage widens with model scale. At the system level, ReCast reduces actor-side update time by 16.60x, lowers peak allocated memory by 16.5%, and improves actor MFU by 14.2%.
Applications
ReCast can be directly applied to generative recommendation systems, particularly in scenarios where natural positives are scarce. By enhancing recommendation quality and system efficiency, ReCast holds significant commercial value in fields such as ad recommendation and product recommendation.
Limitations & Outlook
ReCast's performance in multi-objective or delayed-feedback environments has not been validated, which may affect its applicability in more complex scenarios. The repair mechanism may introduce bias when the model naturally forms learnable boundaries, requiring further optimization. The current repair strategy may become unnecessary in stronger backbone networks, necessitating the development of an adaptive RL-SFT interface.
Plain Language Accessible to non-experts
Imagine you're in a kitchen, preparing a complex dish. The traditional method is to prepare all the ingredients at once, hoping they'll come together perfectly to create a delicious dish. But in reality, sometimes we find that some ingredients aren't fresh enough or don't pair well, leading to a dish that doesn't taste as expected.
ReCast is like a smart chef who continuously adjusts the combination of ingredients during cooking to ensure each step maximizes the potential of the ingredients. First, he checks all the ingredients to ensure none are completely useless. If he finds any that aren't fresh enough, he'll use some spices to enhance their flavor.
Next, he focuses on the ingredients that will most enhance the dish's flavor, rather than spreading his attention evenly across all ingredients. This way, he can ensure each dish reaches its best taste, rather than relying on luck.
Through this method, ReCast not only improves the overall quality of the dish but also reduces wasted ingredients and time. It's like a magician in the kitchen, able to create amazing flavors under limited conditions.
ELI14 Explained like you're 14
Hey there, friends! Today I want to tell you about something super cool called ReCast. Imagine you're playing a game, and your goal is to find treasure hidden on a map. The traditional method is to explore the entire map at once, hoping to find the treasure. But sometimes, this method isn't very efficient because the map is too big and the treasure is too scarce.
ReCast is like a smart explorer who first checks the map to see where the treasure might be hidden. If he finds a place with no signs of treasure, he'll use some clues to help himself find potential treasure locations.
Then, he focuses on the places most likely to have treasure, rather than spreading his time evenly across the entire map. This way, he can find the treasure faster, rather than relying on luck.
Through this method, ReCast not only increases the chance of finding treasure but also reduces wasted time and effort. It's like a magic tool in the game, helping you reach your goal faster under limited conditions.
Glossary
ReCast
ReCast is a repair-then-contrast learning-signal framework designed for sparse-hit generative recommendation. It restores minimal learnability for all-zero groups by injecting a valid positive anchor and applies a boundary-focused contrastive update.
In the paper, ReCast is used to address signal degeneracy under sparse-hit conditions.
Reinforcement Learning
A machine learning method that learns policies by interacting with the environment to maximize cumulative rewards.
In generative recommendation, RL is used to optimize metrics such as hit rate.
Generative Recommendation
Directly generating recommendation items through generative models rather than traditional candidate scoring.
ReCast is applied in generative recommendation tasks to improve recommendation quality.
Signal Degeneracy
In sparse-hit scenarios, many sampled groups never become trainable learning units, leading to signal degeneracy.
ReCast addresses this issue through a repair-then-contrast signal design.
Boundary Contrastive Update
Updating only the strongest positive and hardest negative, replacing full-group reward normalization.
ReCast uses this method to improve the quality of learning signals.
All-zero Group
In sparse-hit scenarios, a group where all responses receive zero rewards, making it unlearnable due to the lack of a positive-negative boundary.
ReCast restores minimal learnability for all-zero groups by injecting a valid positive anchor.
Single-hit Group
In sparse-hit scenarios, a group with only one positive sample, where updates are dominated by one accidental hit and noisy group statistics.
ReCast improves the stability of single-hit groups through boundary contrastive updates.
Repair Mechanism
Restoring minimal learnability for all-zero groups by injecting a valid positive anchor.
ReCast's repair mechanism addresses signal degeneracy issues.
System Efficiency
Refers to the utilization efficiency of time, memory, and computational resources under the same budget.
ReCast significantly improves system efficiency by reducing actor-side update time and memory usage.
Sparse-hit
Scenarios in generative recommendation where natural positives are scarce.
ReCast optimizes recommendation quality in sparse-hit scenarios through a repair-then-contrast signal design.
Open Questions Unanswered questions from this research
- 1 ReCast's performance in multi-objective or delayed-feedback environments has not been validated, which may affect its applicability in more complex scenarios. Future research needs to explore its robustness in different environments.
- 2 The repair mechanism may introduce bias when the model naturally forms learnable boundaries, requiring further optimization. Researchers need to develop an adaptive RL-SFT interface to dynamically adjust repair and signal update strategies.
- 3 The current repair strategy may become unnecessary in stronger backbone networks, necessitating the exploration of more flexible repair strategies to adapt to different model scales and task requirements.
- 4 The performance of ReCast's boundary contrastive update strategy in more complex recommendation tasks is unclear, requiring further research into its applicability across multiple recommendation metrics.
- 5 The stability and performance retention of ReCast during long-term training need to be verified, especially in large-scale datasets and high-complexity tasks.
Applications
Immediate Applications
Ad Recommendation Optimization
By enhancing recommendation quality and system efficiency, ReCast can be directly applied to ad recommendation systems, helping advertisers more accurately reach target users and improve ad conversion rates.
Product Recommendation Enhancement
In e-commerce platforms, ReCast can be used to optimize product recommendations, enhancing user shopping experience and platform sales.
Short Video Recommendation
ReCast performs excellently in short video recommendation, helping platforms increase user engagement and watch time, thereby boosting ad revenue.
Long-term Vision
Cross-platform Recommendation Systems
ReCast's efficiency and adaptability make it a potential core technology for cross-platform recommendation systems, supporting personalized recommendations for various content forms.
Intelligent Content Generation
By optimizing generative recommendation, ReCast can drive the development of intelligent content generation technologies, supporting automated content creation and distribution to enhance user experience.
Abstract
Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast learning-signal framework that first restores minimal learnability for all-zero groups and then replaces full-group reward normalization with a boundary-focused contrastive update on the strongest positive and the hardest negative. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width. Across multiple generative recommendation tasks, ReCast consistently outperforms OpenOneRec-RL, achieving up to 36.6% relative improvement in Pass@1. Its matched-budget advantage is substantially larger: ReCast reaches the baseline's target performance with only 4.1% of the rollout budget, and this advantage widens with model scale. The same design also yields direct system-level gains, reducing actor-side update time by 16.60x, lowering peak allocated memory by 16.5%, and improving actor MFU by 14.2%. Mechanism analysis shows that ReCast mitigates the persistent all-zero / single-hit regime, restores learnability when natural positives are scarce, and converts otherwise wasted rollout budget into more stable policy updates. These results suggest that, for generative recommendation, the decisive RL problem is not only how to assign rewards, but how to construct learnable optimization events from sparse, structured supervision.
References (20)
Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation
Hongxun Ding, Keqin Bao, Jizhi Zhang et al.
Recommender Systems with Generative Retrieval
Shashank Rajput, Nikhil Mehta, Anima Singh et al.
EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration
Minjie Hong, Yan Xia, Zehan Wang et al.
OpenOneRec Technical Report
Guorui Zhou, Honghui Bao, Jiaming Huang et al.
OneRec Technical Report
Guorui Zhou, Jiaxin Deng, Jinghao Zhang et al.
OxygenREC: An Instruction-Following Generative Framework for E-commerce Recommendation
Xuegang Hao, Ming Zhang, Alex Li et al.
Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)
Shijie Geng, Shuchang Liu, Zuohui Fu et al.
M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems
Zeyu Cui, Jianxin Ma, Chang Zhou et al.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Adam Suma, Sam Dauncey
OneRec-Think: In-Text Reasoning for Generative Recommendation
Zhanyun Liu, Shiyao Wang, Xing-Yao Wang et al.
EAGER: Two-Stream Generative Recommender with Behavior-Semantic Collaboration
Yejin Wang, Jiahao Xun, Ming Hong et al.
UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration
Longtao Xiao, Haozhao Wang, Cheng Wang et al.
OneRec-V2 Technical Report
Guorui Zhou, Hengrui Hu, Hongtao Cheng et al.
Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation
Bowen Zheng, Yupeng Hou, Hongyu Lu et al.
SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation
Yu Xie, Xingkai Ren, Ying Qi et al.
Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning
Jiacheng Lin, Tian Wang, Kun Qian
Reinforced Latent Reasoning for LLM-based Recommendation
Yang Zhang, Wenxin Xu, Xiaoyan Zhao et al.
Reasoning over Semantic IDs Enhances Generative Recommendation
Y. He, Yanfan Sun, Junfei Tan et al.
GFlowGR: Fine-tuning Generative Recommendation Frameworks with Generative Flow Networks
Yejing Wang, Shengyu Zhou, Jinyu Lu et al.
Learnable Item Tokenization for Generative Recommendation
Wenjie Wang, Honghui Bao, Xinyu Lin et al.