Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

TL;DR

Skill-RM unifies heterogeneous evaluation criteria via agent skills, enabling dynamic resource orchestration, outperforming traditional judges with a 3-6% improvement on RewardBench2.

cs.LG 🔴 Advanced 2026-06-03 80 views

Tao Chen Gangwei Jiang Pengyu Cheng Siyuan Huang Yihao Liu Jingwei Ni Jiaqi Guo Mengyu Zhou Kai Tang Junling Liu Qinliang Su Xiaoxi Jiang Guanjun Jiang

AI Reader Arxiv Page Download PDF

Reward Modeling Resource Scheduling Agent Skills Evaluation Framework Reinforcement Learning

Key Findings

Methodology

The proposed Skill-RM framework reconceptualizes reward modeling as a structured, skill-mediated execution process. It introduces the Reward-Evaluation Skill, which encapsulates evaluation protocols, resource invocation schemas, and evidence collection schemas. The core components include the Reward-Evaluation Skill (SRM), a resource bank (URM), and an agentic judge (πφ). During evaluation, the judge dynamically retrieves and executes heterogeneous resources—such as rubrics, references, verifiers—guided by explicit invocation protocols defined in MRM. The process generates structured evidence for each criterion, which is then aggregated into a final reward through a deterministic readout function. This approach shifts the reward assessment from static scalar scoring to an active, interpretable, and resource-aware procedure, significantly enhancing flexibility and transparency. The framework leverages specific algorithms like criterion-level evidence collection, resource invocation protocols, and evidence aggregation rules, validated through extensive experiments on benchmarks including RewardBench2, RM-Bench, and JudgeBench, with models like Qwen-3.5-27B and Qwen-3.5-122B showing 3-6% performance gains over baselines.

Key Results

On RewardBench2, Skill-RM achieved an average score of 86.2, surpassing traditional judges like GPT-4 (65.9) and static reward models such as INF-ORM-Llama3.1-70B (74.0), demonstrating a 10-point improvement. The method maintained robustness across multiple evaluation dimensions, including content quality, factual correctness, and stylistic consistency.
In RM-Bench and JudgeBench, Skill-RM outperformed comparator models by 4-6 percentage points in content subtlety and style evaluation metrics. Notably, in multi-resource scheduling scenarios, it effectively integrated diverse evidence sources, leading to more nuanced and accurate judgments.
Ablation studies confirmed that performance gains primarily stem from the dynamic resource scheduling and evidence synthesis capabilities of Skill-RM, rather than merely increasing resource availability. When resource scheduling was disabled, performance dropped by approximately 4%, highlighting the importance of the proposed orchestration mechanism.

Significance

This work addresses a fundamental challenge in reward modeling: the integration of heterogeneous evaluation criteria and resources into a unified, interpretable framework. Traditional scalar reward models lack transparency and adaptability, limiting their effectiveness in complex, multi-faceted tasks. Skill-RM's agent skill paradigm enables explicit, resource-aware evaluation, which not only improves reward quality but also enhances interpretability and trustworthiness. This advancement holds significant implications for aligning large language models with human preferences, automating complex evaluation processes, and advancing reinforcement learning with structured feedback. By providing a scalable, modular architecture, Skill-RM paves the way for more reliable, transparent, and adaptable AI systems in both research and industry applications.

Technical Contribution

The core technical innovation lies in formalizing reward evaluation as a reusable, structured skill—Reward-Evaluation Skill—that orchestrates heterogeneous resources via explicit invocation protocols. This design enables modular, flexible, and interpretable reward computation, contrasting sharply with previous approaches relying on monolithic prompts or opaque scalar scores. The framework introduces criterion-level evidence collection, resource invocation schemas, and evidence aggregation rules, all encapsulated within a formal specification (MRM). The resource bank (URM) is curated through an LLM-assisted pipeline, ensuring reusability and consistency. The agentic judge (πφ) actively interacts with resources, generating structured evidence (𝑒𝑚) for each criterion, which is then aggregated into a final judgment 𝑧. The deterministic readout function (A) interprets the structured judgment into a scalar or selection output, unifying pointwise and multi-candidate evaluation paradigms. Extensive experiments validate that this modular, resource-aware approach outperforms existing models in multiple benchmarks, demonstrating its broad applicability and robustness.

Novelty

This research introduces the concept of Reward-Evaluation Skill as a modular, executable artifact for reward modeling, a novel abstraction that encapsulates resource orchestration, evidence collection, and judgment synthesis. Unlike prior work limited to static prompts or single-resource evaluation, Skill-RM formalizes the entire reward computation as a structured, resource-driven process, enabling dynamic, input-adaptive evaluation. The explicit invocation protocol and structured evidence chain provide unprecedented interpretability and flexibility, setting a new paradigm in reward modeling. This approach is the first to systematically unify heterogeneous evaluation resources within a formal, reusable skill framework, significantly advancing the state-of-the-art in reward-based alignment and evaluation systems.

Limitations

The construction and maintenance of the resource bank (URM) require substantial manual effort and domain expertise, which may limit scalability and rapid deployment in diverse application scenarios.
In highly complex or multi-turn interactions, the resource invocation and evidence aggregation process may introduce latency and computational overhead, impacting real-time applicability.
The current framework primarily focuses on static evaluation scenarios; extending it to dynamic, multi-modal, and multi-turn contexts remains an open challenge requiring further research.

Future Work

未来，作者计划优化资源调度策略，降低资源库的维护成本，探索自动化资源生成与更新机制。同时，将该框架扩展到多模态、多轮对话和实时交互场景，提升系统的泛化能力和响应速度。此外，结合强化学习和自监督学习机制，优化证据整合和奖励反馈的效率，也是未来的重要方向。通过引入学习驱动的资源调度策略，赋能系统在不同任务和环境中自主调整资源调用策略，将极大推动奖励模型在实际应用中的落地和普及。

AI Executive Summary

In the rapidly evolving landscape of large language models (LLMs), reward models (RMs) serve as a cornerstone for aligning model behaviors with human preferences and task-specific goals. Traditional reward mechanisms predominantly rely on static scalar scores or simple preference comparisons, which often lack transparency, interpretability, and adaptability across diverse evaluation scenarios. As LLM capabilities expand into reasoning, coding, and multi-modal interactions, the evaluation criteria become increasingly complex, involving external references, multi-step reasoning, safety constraints, and resource-dependent verification. Existing approaches struggle to integrate these heterogeneous resources into a coherent, flexible reward system, limiting their effectiveness and trustworthiness.

Addressing this challenge, Tao Chen and colleagues introduce Skill-RM, a novel framework that reconceptualizes reward modeling as a structured, agent skill-driven execution process. Central to this approach is the Reward-Evaluation Skill, a modular, reusable artifact that encapsulates evaluation protocols, resource invocation schemas, and evidence collection strategies. This skill orchestrates the entire reward computation lifecycle, from dynamically retrieving relevant resources—such as rubrics, references, verifiers—to synthesizing structured evidence and producing a final, interpretable reward output. Unlike traditional methods that treat reward as a monolithic scalar, Skill-RM emphasizes explicit, evidence-grounded decision-making, enhancing transparency and robustness.

The framework's core components include the explicit specification of evaluation criteria, a curated resource bank (URM), and a skill-mediated evaluation process driven by an agentic judge (πφ). During evaluation, the judge actively interacts with the resource bank, invoking relevant tools and resources based on input prompts and response candidates. This process generates structured evidence for each criterion, which is then aggregated into a final judgment via a deterministic readout function. The entire process is governed by a formal invocation protocol, ensuring reproducibility, interpretability, and adaptability across tasks.

Extensive experiments demonstrate the effectiveness of Skill-RM. On benchmarks like RewardBench2, RM-Bench, and JudgeBench, it outperforms existing models, including GPT-4 judges and static reward models, with improvements of 3-6 percentage points in average scores. The ablation studies confirm that the performance gains are primarily due to the dynamic resource scheduling and evidence synthesis capabilities of the framework. These results highlight the potential of Skill-RM to significantly improve reward quality, model alignment, and evaluation transparency.

In conclusion, Skill-RM offers a scalable, modular, and interpretable approach to reward modeling, capable of handling the increasing complexity of evaluation criteria in modern LLM applications. Its emphasis on explicit resource orchestration and structured evidence collection paves the way for more reliable, transparent, and adaptable AI systems. Future directions include automating resource management, extending to multi-modal and multi-turn scenarios, and integrating with reinforcement learning to further enhance model alignment and safety.

Deep Dive

Abstract

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.

cs.LG cs.CL

References (20)

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.

2024 256 citations ⭐ Influential View Analysis →

RewardBench 2: Advancing Reward Model Evaluation

Saumya Malik, Valentina Pyatkin, Sander Land et al.

2025 94 citations ⭐ Influential View Analysis →

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, E. Mitchell et al.

2023 8981 citations ⭐ Influential View Analysis →

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.

2023 8897 citations ⭐ Influential View Analysis →

SoK: Agentic Skills - Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Hai Deng et al.

2026 44 citations ⭐ Influential View Analysis →

Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Ran Xu, Jingjing Chen, Jiayu Ye et al.

2025 13 citations ⭐ Influential View Analysis →

VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

Hao Peng, Yunjia Qi, Xiaozhi Wang et al.

2025 26 citations View Analysis →

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang et al.

2023 1755 citations View Analysis →

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Seungone Kim, Juyoung Suk, Shayne Longpre et al.

2024 436 citations View Analysis →

Everyone Deserves A Reward: Learning Customized Human Preferences

Pengyu Cheng, Jiawen Xie, Ke Bai et al.

2023 46 citations View Analysis →

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

Seonghyeon Ye, Doyoung Kim, Sungdong Kim et al.

2023 196 citations View Analysis →

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

Haoxiang Wang, Wei Xiong, Tengyang Xie et al.

2024 378 citations View Analysis →

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun et al.

2021 9888 citations View Analysis →

Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Ilgee Hong, Changlong Yu, Liang Qiu et al.

2025 11 citations View Analysis →

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu et al.

2023 1325 citations View Analysis →

Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?

Arduin Findeis, Floris Weers, Guoli Yin et al.

2025 8 citations View Analysis →

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Seungone Kim, Jamin Shin, Yejin Cho et al.

2023 501 citations View Analysis →

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

Yuxin Jiang, Yufei Wang, Xingshan Zeng et al.

2023 94 citations View Analysis →

RewardBench: Evaluating Reward Models for Language Modeling

Nathan Lambert, Valentina Pyatkin, Jacob Daniel Morrison et al.

2024 425 citations View Analysis →

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

Hongliang Lu, Yuhang Wen, Pengyu Cheng et al.

2025 19 citations View Analysis →

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies