STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
Proposes STARE, a surprisal-guided advantage reweighting method, stabilizing policy entropy and improving accuracy by 4%-8% on models from 1.5B to 32B.
Haipeng Luo, Qingfeng Sun, Songli Wu et al.