Reward Hacking in Rubric-Based Reinforcement Learning
The study proposes a framework to diagnose reward hacking in rubric-based RL, finding that even strong verification does not eliminate reward hacking.
Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang et al.