Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill
Skill-RM unifies heterogeneous evaluation criteria via agent skills, enabling dynamic resource orchestration, outperforming traditional judges with a 3-6% improvement on RewardBench2.
Key Findings
Methodology
The proposed Skill-RM framework reconceptualizes reward modeling as a structured, skill-mediated execution process. It introduces the Reward-Evaluation Skill, which encapsulates evaluation protocols, resource invocation schemas, and evidence collection schemas. The core components include the Reward-Evaluation Skill (SRM), a resource bank (URM), and an agentic judge (πφ). During evaluation, the judge dynamically retrieves and executes heterogeneous resources—such as rubrics, references, verifiers—guided by explicit invocation protocols defined in MRM. The process generates structured evidence for each criterion, which is then aggregated into a final reward through a deterministic readout function. This approach shifts the reward assessment from static scalar scoring to an active, interpretable, and resource-aware procedure, significantly enhancing flexibility and transparency. The framework leverages specific algorithms like criterion-level evidence collection, resource invocation protocols, and evidence aggregation rules, validated through extensive experiments on benchmarks including RewardBench2, RM-Bench, and JudgeBench, with models like Qwen-3.5-27B and Qwen-3.5-122B showing 3-6% performance gains over baselines.
Key Results
- On RewardBench2, Skill-RM achieved an average score of 86.2, surpassing traditional judges like GPT-4 (65.9) and static reward models such as INF-ORM-Llama3.1-70B (74.0), demonstrating a 10-point improvement. The method maintained robustness across multiple evaluation dimensions, including content quality, factual correctness, and stylistic consistency.
- In RM-Bench and JudgeBench, Skill-RM outperformed comparator models by 4-6 percentage points in content subtlety and style evaluation metrics. Notably, in multi-resource scheduling scenarios, it effectively integrated diverse evidence sources, leading to more nuanced and accurate judgments.
- Ablation studies confirmed that performance gains primarily stem from the dynamic resource scheduling and evidence synthesis capabilities of Skill-RM, rather than merely increasing resource availability. When resource scheduling was disabled, performance dropped by approximately 4%, highlighting the importance of the proposed orchestration mechanism.
Significance
This work addresses a fundamental challenge in reward modeling: the integration of heterogeneous evaluation criteria and resources into a unified, interpretable framework. Traditional scalar reward models lack transparency and adaptability, limiting their effectiveness in complex, multi-faceted tasks. Skill-RM's agent skill paradigm enables explicit, resource-aware evaluation, which not only improves reward quality but also enhances interpretability and trustworthiness. This advancement holds significant implications for aligning large language models with human preferences, automating complex evaluation processes, and advancing reinforcement learning with structured feedback. By providing a scalable, modular architecture, Skill-RM paves the way for more reliable, transparent, and adaptable AI systems in both research and industry applications.
Technical Contribution
The core technical innovation lies in formalizing reward evaluation as a reusable, structured skill—Reward-Evaluation Skill—that orchestrates heterogeneous resources via explicit invocation protocols. This design enables modular, flexible, and interpretable reward computation, contrasting sharply with previous approaches relying on monolithic prompts or opaque scalar scores. The framework introduces criterion-level evidence collection, resource invocation schemas, and evidence aggregation rules, all encapsulated within a formal specification (MRM). The resource bank (URM) is curated through an LLM-assisted pipeline, ensuring reusability and consistency. The agentic judge (πφ) actively interacts with resources, generating structured evidence (𝑒𝑚) for each criterion, which is then aggregated into a final judgment 𝑧. The deterministic readout function (A) interprets the structured judgment into a scalar or selection output, unifying pointwise and multi-candidate evaluation paradigms. Extensive experiments validate that this modular, resource-aware approach outperforms existing models in multiple benchmarks, demonstrating its broad applicability and robustness.
Novelty
This research introduces the concept of Reward-Evaluation Skill as a modular, executable artifact for reward modeling, a novel abstraction that encapsulates resource orchestration, evidence collection, and judgment synthesis. Unlike prior work limited to static prompts or single-resource evaluation, Skill-RM formalizes the entire reward computation as a structured, resource-driven process, enabling dynamic, input-adaptive evaluation. The explicit invocation protocol and structured evidence chain provide unprecedented interpretability and flexibility, setting a new paradigm in reward modeling. This approach is the first to systematically unify heterogeneous evaluation resources within a formal, reusable skill framework, significantly advancing the state-of-the-art in reward-based alignment and evaluation systems.
Limitations
- The construction and maintenance of the resource bank (URM) require substantial manual effort and domain expertise, which may limit scalability and rapid deployment in diverse application scenarios.
- In highly complex or multi-turn interactions, the resource invocation and evidence aggregation process may introduce latency and computational overhead, impacting real-time applicability.
- The current framework primarily focuses on static evaluation scenarios; extending it to dynamic, multi-modal, and multi-turn contexts remains an open challenge requiring further research.
Future Work
未来,作者计划优化资源调度策略,降低资源库的维护成本,探索自动化资源生成与更新机制。同时,将该框架扩展到多模态、多轮对话和实时交互场景,提升系统的泛化能力和响应速度。此外,结合强化学习和自监督学习机制,优化证据整合和奖励反馈的效率,也是未来的重要方向。通过引入学习驱动的资源调度策略,赋能系统在不同任务和环境中自主调整资源调用策略,将极大推动奖励模型在实际应用中的落地和普及。
AI Executive Summary
In the rapidly evolving landscape of large language models (LLMs), reward models (RMs) serve as a cornerstone for aligning model behaviors with human preferences and task-specific goals. Traditional reward mechanisms predominantly rely on static scalar scores or simple preference comparisons, which often lack transparency, interpretability, and adaptability across diverse evaluation scenarios. As LLM capabilities expand into reasoning, coding, and multi-modal interactions, the evaluation criteria become increasingly complex, involving external references, multi-step reasoning, safety constraints, and resource-dependent verification. Existing approaches struggle to integrate these heterogeneous resources into a coherent, flexible reward system, limiting their effectiveness and trustworthiness.
Addressing this challenge, Tao Chen and colleagues introduce Skill-RM, a novel framework that reconceptualizes reward modeling as a structured, agent skill-driven execution process. Central to this approach is the Reward-Evaluation Skill, a modular, reusable artifact that encapsulates evaluation protocols, resource invocation schemas, and evidence collection strategies. This skill orchestrates the entire reward computation lifecycle, from dynamically retrieving relevant resources—such as rubrics, references, verifiers—to synthesizing structured evidence and producing a final, interpretable reward output. Unlike traditional methods that treat reward as a monolithic scalar, Skill-RM emphasizes explicit, evidence-grounded decision-making, enhancing transparency and robustness.
The framework's core components include the explicit specification of evaluation criteria, a curated resource bank (URM), and a skill-mediated evaluation process driven by an agentic judge (πφ). During evaluation, the judge actively interacts with the resource bank, invoking relevant tools and resources based on input prompts and response candidates. This process generates structured evidence for each criterion, which is then aggregated into a final judgment via a deterministic readout function. The entire process is governed by a formal invocation protocol, ensuring reproducibility, interpretability, and adaptability across tasks.
Extensive experiments demonstrate the effectiveness of Skill-RM. On benchmarks like RewardBench2, RM-Bench, and JudgeBench, it outperforms existing models, including GPT-4 judges and static reward models, with improvements of 3-6 percentage points in average scores. The ablation studies confirm that the performance gains are primarily due to the dynamic resource scheduling and evidence synthesis capabilities of the framework. These results highlight the potential of Skill-RM to significantly improve reward quality, model alignment, and evaluation transparency.
In conclusion, Skill-RM offers a scalable, modular, and interpretable approach to reward modeling, capable of handling the increasing complexity of evaluation criteria in modern LLM applications. Its emphasis on explicit resource orchestration and structured evidence collection paves the way for more reliable, transparent, and adaptable AI systems. Future directions include automating resource management, extending to multi-modal and multi-turn scenarios, and integrating with reinforcement learning to further enhance model alignment and safety.
Deep Dive
Abstract
Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.
References (20)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.
RewardBench 2: Advancing Reward Model Evaluation
Saumya Malik, Valentina Pyatkin, Sander Land et al.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, E. Mitchell et al.
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.
SoK: Agentic Skills - Beyond Tool Use in LLM Agents
Yanna Jiang, Delong Li, Hai Deng et al.
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Ran Xu, Jingjing Chen, Jiayu Ye et al.
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng, Yunjia Qi, Xiaozhi Wang et al.
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang et al.
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Seungone Kim, Juyoung Suk, Shayne Longpre et al.
Everyone Deserves A Reward: Learning Customized Human Preferences
Pengyu Cheng, Jiawen Xie, Ke Bai et al.
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Seonghyeon Ye, Doyoung Kim, Sungdong Kim et al.
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
Haoxiang Wang, Wei Xiong, Tengyang Xie et al.
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun et al.
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
Ilgee Hong, Changlong Yu, Liang Qiu et al.
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Sewon Min, Kalpesh Krishna, Xinxi Lyu et al.
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
Arduin Findeis, Floris Weers, Guoli Yin et al.
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Seungone Kim, Jamin Shin, Yejin Cho et al.
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
Yuxin Jiang, Yufei Wang, Xingshan Zeng et al.
RewardBench: Evaluating Reward Models for Language Modeling
Nathan Lambert, Valentina Pyatkin, Jacob Daniel Morrison et al.
Search Self-play: Pushing the Frontier of Agent Capability without Supervision
Hongliang Lu, Yuhang Wen, Pengyu Cheng et al.