SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

TL;DR

Introduces SkillResolve-Bench and SkillResolve, achieving Recall@3 0.766, NDCG@3 0.699, and HSR@3=0, effectively reducing same-capability ambiguity risks.

cs.IR 🔴 Advanced 2026-06-09 112 views

Jiandong Ding

AI Reader Arxiv Page Download PDF

AI Skill Retrieval Capability Ambiguity Risk Mitigation Benchmarking Knowledge Base Management

Key Findings

Methodology

This study presents SkillResolve-Bench 1.0, an auditable benchmark comprising 661 helpful-risk skill pairs, with source roles, admission evidence, cue/leakage checks, query-disjoint splits, and a candidate pool of 7,982 skills. The core algorithm, SkillResolve, employs a Capability Resolver to identify candidate groups representing the same capability family, a query-conditioned Utility Scorer to evaluate each candidate’s usefulness, and a Representative Selector to pick one top candidate per group before final ranking. The utility model incorporates resource bindings, preconditions, API scope, output schema, and procedure cues, trained via pairwise logistic regression to distinguish helpful from confusable negatives. During inference, the system resolves candidate groups, scores within-group candidates, and selects the highest-utility representative, effectively preventing risky skills from surfacing in the top-K list. Extensive experiments demonstrate superior performance over baselines like SkillRouter and BGE reranking, with significant improvements in recall and safety metrics.

Key Results

On the 661 test pairs, SkillResolve achieved Recall@3 0.766, NDCG@3 0.699, and a harmful sibling rate (HSR@3) of 0, outperforming SkillRouter (Recall@3 0.654, HSR@3 0.693) and BGE reranking (Recall@3 0.461, HSR@3 0.461). The method maintained stable performance across a candidate pool of 7,982 skills, demonstrating robustness and scalability.
By integrating capability group resolution and within-group representative selection, the system significantly reduces the exposure of risky skills while maintaining high retrieval quality, effectively balancing recall and safety.
Ablation studies confirm that capability resolution and representative selection are critical components, with their removal leading to increased risk exposure and decreased recall, validating the design choices.

Significance

This work addresses a critical gap in AI skill retrieval—managing the ambiguity within capability families to prevent harmful skill exposure. It provides a systematic benchmark and a practical algorithm that enhances safety and reliability in large-scale skill libraries. The approach offers a pathway toward safer autonomous agents, capable of operating effectively without risking execution errors or security breaches, thus advancing both academic research and industrial deployment of AI systems.

Technical Contribution

The main technical innovation lies in combining capability family resolution with query-conditioned utility scoring and representative selection. Unlike prior relevance-based methods, SkillResolve explicitly models the intra-family ambiguity, leveraging resource, precondition, API, and output cues to score candidates. The framework includes a capability resolver that groups candidates, a pairwise logistic regression scorer trained on confusable negatives, and a representative selector that ensures only the highest-utility skill per family is exposed. This multi-layered approach effectively suppresses risky skills, improves recall, and provides interpretability and audibility in the retrieval process.

Novelty

This research is the first to formalize and benchmark the problem of same-capability execution-risk in skill retrieval, introducing a family-based resolution mechanism and a query-specific risk labeling system. It moves beyond traditional relevance metrics, integrating safety considerations directly into the retrieval pipeline. The combination of capability family parsing, query-conditioned scoring, and representative selection constitutes a novel framework that addresses a long-standing challenge in AI skill management.

Limitations

The approach relies on predefined capability family relations, which may be inaccurate or incomplete, affecting the effectiveness of the resolution process. Automating family relation learning remains an open challenge.
In highly complex or dynamic environments, the fixed candidate pool and static family relations may not capture evolving capabilities, leading to residual risks.
The training process depends on carefully mined confusable negatives, which may require extensive feature engineering and domain knowledge, limiting adaptability to new domains or languages.

Future Work

Future efforts will focus on dynamic learning of capability relations using unsupervised or semi-supervised methods, integrating multi-modal cues for richer contract-profile representations, and extending the framework to multi-task, multi-agent scenarios. Additionally, exploring real-time adaptation and online risk assessment will further enhance the safety and robustness of AI skill retrieval systems.

AI Executive Summary

In the rapidly expanding landscape of AI skill libraries, the challenge of accurately retrieving the most relevant capabilities while avoiding risky or irrelevant skills has become increasingly critical. Traditional relevance-based retrieval methods, such as dense embedding models or rerankers like BGE, have demonstrated strong semantic matching capabilities. However, they often struggle with the nuanced problem of intra-family ambiguity—where multiple skills share similar semantics but differ significantly in execution safety. This ambiguity can lead to the exposure of harmful skills, risking task failure or security breaches.

To address this, the authors introduce SkillResolve-Bench 1.0, a comprehensive benchmark designed to quantify and evaluate the ability of retrieval systems to distinguish helpful skills from query-specific risky siblings within large skill pools. The benchmark includes 661 pairs of helpful and risky skills, with detailed annotations on source roles, admission evidence, cue/leakage checks, and a candidate pool of nearly 8,000 skills. This setup simulates real-world scenarios where a system must not only find relevant skills but also avoid exposing potentially harmful ones.

Building upon this, the paper proposes SkillResolve, an innovative retrieval framework that integrates capability family resolution with query-conditioned utility scoring. The core components include a Capability Resolver that groups candidates into families, a Utility Scorer trained via pairwise logistic regression to evaluate skill usefulness considering resource bindings, preconditions, and output schemas, and a Representative Selector that picks the highest-utility skill from each family before the final ranking. This multi-stage process ensures that only the safest and most relevant skills are exposed to the agent.

Experimental results demonstrate that SkillResolve significantly outperforms baseline models. It achieves a Recall@3 of 0.766 and a NDCG@3 of 0.699, while maintaining a harmful sibling rate of zero, indicating no risky skills are exposed in the top results. Compared to existing methods like SkillRouter, it improves recall by over 0.11 and reduces risk exposure dramatically. Ablation studies confirm that capability resolution and representative selection are crucial for balancing recall and safety.

This research advances the field by providing a formalized benchmark and a practical solution for intra-family skill ambiguity, addressing a critical gap in AI safety and reliability. Its implications extend to autonomous agents, knowledge management, and secure AI deployment, paving the way for more trustworthy AI systems. Future directions include learning dynamic family relations, incorporating multi-modal cues, and extending the framework to multi-task and multi-agent environments, aiming for adaptable, safe, and efficient AI capabilities.

Deep Analysis

Background

随着人工智能技术的不断发展，技能库已成为智能代理实现复杂任务的核心组成部分。早期的技能检索主要依赖关键词匹配和简单的相关性模型，如TF-IDF和余弦相似度，难以满足大规模、多能力、多角色场景的需求。近年来，深度预训练模型如BERT、GPT系列被引入技能匹配，显著提升了语义理解能力，但仍面临能力模糊和风险暴露的问题。尤其是在公共技能库中，存在大量相似但不同的技能，容易引发执行偏差甚至安全风险。相关研究如SkillsVote、SkillRouter和SkillRet等，尝试通过证据驱动、能力关系解析等手段优化检索效果，但未能系统性解决相同能力族内的风险控制问题。近年来，学界开始关注技能的安全性和治理，提出权限管理、恶意技能检测等措施，但缺乏针对能力模糊性风险的系统评估工具。本研究在此背景下，提出了面向能力族的风险衡量基准和算法框架，旨在弥补现有方法的不足。

Core Problem

核心问题在于，现有技能检索系统虽然能找到相关能力，但在能力族内部，容易暴露风险技能，即那些虽然与任务相关，但可能引发执行偏差或安全问题的技能。这种模糊性导致系统在提供技能时，不能有效区分有用技能与风险技能，增加了任务失败和安全隐患的可能性。传统方法多关注单一相关性指标，忽视了能力族内的代表性选择，导致风险技能在排名中占据较高位置。解决这一问题需要在检索过程中引入能力族解析和风险控制机制，确保最终输出的技能既高效又安全。然而，如何定义和识别能力族、如何在大规模候选池中进行有效筛选，仍是技术难点。此外，缺乏标准化的评估指标，使得不同方法的性能难以直接比较，限制了研究的深入推进。

Innovation

本研究的创新点主要体现在以下几个方面：首先，提出了SkillResolve-Bench 1.0，建立了一个包含661对有用与风险技能配对的可审计基准，涵盖源角色、证据、cue检测等多维信息，为能力模糊性风险的评估提供了标准化平台。其次，设计了SkillResolve算法框架，结合能力族解析（Capability Resolver）和查询调节的实用性评分（Utility Scorer），实现了在大规模候选池中对能力族的识别与代表技能筛选。该方法通过线性模型结合资源绑定、预条件、API范围等contract-profile cues，有效降低了风险技能的暴露概率。再次，提出了query-specific的风险标签机制，使得每个查询都能动态识别潜在风险，提升了检索的安全性。最后，实验验证了该方法在多个指标上的优越表现，显著优于传统相关性排序模型，为未来智能系统的能力管理提供了新思路。

Methodology

�� 构建数据集：收集661对有用与风险技能配对，记录源角色、证据、cue检测信息，建立7,982候选技能池。
�� 能力族解析：利用预定义的family relation g⋆，将候选技能划分为多个能力族，每个族代表一组相似能力的技能。
�� 查询调节评分：设计线性模型Fθ(q, s)，结合资源绑定、预条件、API范围、输出模式等contract-profile cues，利用训练样本中的正负例学习参数θ。
�� 负样本挖掘：在训练过程中，从候选池中筛选与正样本相似但未被采纳的技能，作为confusable negatives，用于提升模型判别能力。
�� 能力族筛选：通过能力解析（ρ）识别候选族，形成多个候选能力组G1(q), G2(q), ..., Gm(q)，每组代表一个潜在能力族。
�� 代表技能筛选：在每个能力组内，利用实用性评分（U(s)）选择最高的技能作为代表（rep(G, q)），确保只暴露最优代表。
�� 最终排序：将每个能力族的代表技能加入候选池，基于U(s)进行全局排序，输出前K个技能作为最终推荐列表。
�� 评估指标：采用Recall@K、NDCG@K和风险暴露（HSR@K）评估模型性能，确保在高召回率的同时最大限度地降低风险技能的暴露。

Experiments

实验基于SkillResolve-Bench 1.0平台，使用661个有用/风险技能配对作为测试集，候选池规模为7,982个技能。模型与多种基线（如SkillRouter、BGE reranking、Attribution-listwise）进行对比，指标包括Recall@3、Recall@5、NDCG@3、NDCG@5和风险暴露（HSR@3、HSR@5）。采用五折交叉验证，模型参数在验证集调优。负样本挖掘采用在训练过程中筛选相似但未采纳的技能，确保模型对confusable negatives的判别能力。模型训练采用L2正则化的线性逻辑回归，调节系数α控制实用性评分与基础得分的融合比例。通过消融实验验证能力族解析、代表筛选和负样本挖掘对性能的贡献。

Results

在测试中，SkillResolve实现了Recall@3 0.766，NDCG@3 0.699，风险暴露（HSR@3）为0，显著优于SkillRouter（Recall@3 0.654，HSR@3 0.693）和BGE reranking（Recall@3 0.461，HSR@3 0.461）。在候选池规模扩大到7982时，表现依然稳定，说明模型具有良好的泛化能力。消融实验显示，能力族解析和代表筛选是提升性能的关键因素，去除其中任何一环都导致性能下降。模型在不同的指标上均优于基线，验证了其在实际应用中的有效性和鲁棒性。

Applications

该技术适用于智能助手、自动化流程、知识库管理等场景，尤其在需要高安全性和可靠性的任务中表现突出。通过精确识别能力族和代表技能，系统能有效避免风险技能带来的潜在危害，提升用户体验和系统可信度。未来可结合多模态信息和动态能力关系，扩展到多任务、多角色的复杂环境中，推动智能系统的安全自主发展。

Limitations & Outlook

目前方法依赖于预定义的能力族关系，若族关系定义不准确或覆盖不足，可能影响筛选效果。模型在极端复杂或多变的场景中可能仍存在风险暴露，尤其是在候选能力族划分不细或资源信息不足时。此外，训练过程中对负样本的依赖增加了模型的复杂度和调优难度，未来需探索更自动化的族关系学习和多模态融合技术，以提升适应性和泛化能力。

Plain Language Accessible to non-experts

想象你在一家大型餐厅工作，菜单上有很多不同的菜肴。每次点菜时，厨师会根据你的需求推荐几道菜，但有时候，虽然两道菜看起来很相似，比如都叫‘牛肉炒面’，但其中一款可能用的牛肉已经不新鲜，或者调料不合你的口味。这就像技能库中的技能一样，表面上相似，但实际执行效果可能大不一样。

在这个餐厅里，厨师需要学会区分这些相似的菜肴，确保推荐的菜肴既符合你的口味，又不会带来麻烦。研究人员就像这个厨师，他们开发了一套系统，能够在海量的技能中找到最合适的那个，同时避免推荐那些可能带来风险的“坏菜”。

他们设计了一个“菜单筛选器”，可以识别出每个菜肴的特点，比如用料、调料、做法等，然后根据你的点菜需求，挑选出最合适的菜肴。这就像算法中的能力族解析和实用性评分一样，确保推荐的技能既有用，又安全。

实验结果显示，这套系统能在成千上万的技能中，准确找到最合适的技能，同时避免暴露风险技能，效果比传统方法好很多。未来，这种技术还能用在智能助手、自动驾驶等领域，让机器变得更聪明、更安全。

总之，就像一个懂得区分好菜和坏菜的厨师一样，这项研究让智能系统学会了在海量信息中挑选出最优、最安全的“菜肴”，为我们的生活带来更多便利和保障。

ELI14 Explained like you're 14

想象你在学校的图书馆里，有成千上万的书。你想找一本关于“科学实验”的书，但书架上有很多类似的书，比如“化学实验”和“物理实验”。如果图书馆的检索系统只看书的标题，可能会把不相关的书也推荐给你。这就像技能搜索一样，系统可能找到一组看起来很相似的技能，但其中有一些实际上不适合你的任务，甚至可能带来麻烦。

这项研究就像是给图书馆设计了一套聪明的筛选方法。它不仅看书的标题，还会考虑书的内容、作者、出版时间等信息，帮你找到最合适的那一本，同时避免推荐那些不合适甚至有风险的书。

具体来说，研究者开发了一个叫SkillResolve的系统，它会把所有技能分成不同的“能力族”，就像把书按主题分类一样。然后，它会根据你的需求，挑出每个类别中最合适的代表技能，就像图书馆推荐最相关的书一样。这样一来，系统就能确保你得到的技能既有用，又不会带来潜在的风险。

实验结果显示，这个方法比传统的只看关键词的检索更聪明，能大大减少错误推荐的风险，同时提高找到正确技能的概率。未来，这项技术可以让智能助手变得更安全、更可靠，帮助我们更好地完成各种任务。

Glossary

能力族 (Capability Family)

一组具有相似任务能力的技能集合，代表某一类功能的不同实现方式。技术上通过能力关系图定义，应用中用于筛选代表技能。

用于能力解析和代表筛选，避免技能模糊带来的风险。

实用性评分 (Utility Score)

衡量技能在特定查询中的适用价值，用于排序和筛选代表技能。技术上通过特征向量和线性回归实现，优化技能排序。

核心算法之一，用于在候选技能中筛选最优代表。

能力解析 (Capability Resolver)

识别候选技能所属能力族的算法，根据预定义关系或元数据，将技能划分为不同的能力族。技术上采用图结构匹配或规则解析，确保族关系准确。

关键步骤，确保族内技能的正确归类。

查询调节 (Query-conditioned)

根据用户输入或任务需求动态调整模型评分或筛选策略的机制。技术上通过特征融合和调节参数实现，增强模型适应性。

提升检索的相关性和安全性。

风险技能 (Risky Skill)

在能力族中，虽然与任务相关，但可能引发执行偏差或安全问题的技能。技术定义为查询特定的潜在风险标签。

检索中需要特别规避的目标。

候选池 (Candidate Pool)

在检索任务中，所有可能被考虑的技能集合。技术上由预定义的技能库组成，规模通常较大。

模型筛选的基础数据源。

风险暴露 (Harmful Sibling Rate, HSR)

衡量在最终排名中暴露风险技能的比例指标。技术上为风险技能在前K名中的出现频率。

评估模型安全性的重要指标。

代表技能 (Representative Skill)

在能力族中，经过筛选后被选为代表的最优技能。技术上由实用性评分最高的技能确定。

确保最终输出的技能既有代表性又安全。

查询-disjoint拆分 (Query-disjoint Split)

将数据集中的查询进行划分，确保训练和测试集中的查询不重叠，以避免信息泄露。技术上通过哈希或索引实现。

保证模型评估的公平性。

线性模型 (Linear Model)

一种简单的机器学习模型，通过线性组合特征进行预测。技术上采用逻辑回归或线性回归。

用于实用性评分的训练。

contract-profile cues

描述技能在资源绑定、预条件、API范围等方面的执行合同信息。技术上通过文本提取和特征编码实现。

辅助评分模型判断技能的执行适配性。

能力族关系 g⋆

由研究定义的技能族关系，用于标识技能之间的能力归属。技术上通过元数据或手工定义的关系图实现。

作为能力解析的基础关系。

Abstract

Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and execution assumptions to an agent. This makes skill retrieval more than broad relevance matching. A retriever can find the right capability family yet expose the wrong same-capability representative. We study this failure as same-capability execution-risk retrieval. Each query pairs a helpful skill with a query-specific risky sibling that shares the capability family but can lead execution toward a stale resource, missing precondition, or wrong procedure. We introduce SkillResolve-Bench 1.0, an auditable benchmark for this setting with 661 helpful/risky pairs, source-role and admission evidence, cue/leakage checks, query-disjoint splits, and a 7,982-candidate pool that includes 6,660 public SkillRet candidates. The benchmark reports helpful ranking together with harmful sibling rate (HSR@K), the top-K exposure of the risky sibling. We also provide SkillResolve, a reference method that resolves active candidate families, scores query-conditioned utility from confusable library negatives and contract-profile cues, and selects one representative from each family before the final top-K list. Under the released family relation, SkillResolve reaches Recall@3 0.766 and NDCG@3 0.699 while keeping HSR@3=0. It improves over SkillRouter by 0.112 Recall@3 and 0.165 NDCG@3 while reducing HSR@3 from 0.693 to 0. Without representative selection, HSR@3 rises to 0.236 under the same scorer, identifying within-family representative choice as the mechanism that turns capability retrieval into safer procedural exposure.

cs.IR cs.AI

References (13)

MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills

Wenbo Guo, Wei Zeng, Chengwei Liu et al.

2026 1 citations ⭐ Influential View Analysis →

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

Chang Jin, Anr'an W'ang, Zeming Wei et al.

2026 3 citations View Analysis →

Programs

C. Mazzetti, J. Verschueren, Marcella Papi et al.

1984 961 citations

Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing

Zifei Wang, Wei Wen, Qian Ji et al.

2026 1 citations View Analysis →

Benchmarking

Reinhold Weicker

1998 436 citations

Reciprocal rank fusion outperforms condorcet and individual rank learning methods

G. Cormack, C. Clarke, Stefan Büttcher

2009 962 citations

Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills

Lijia Lv, Xuehai Tang, Jie Wen et al.

2026 3 citations View Analysis →

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Yinghan Hou, Zongyou Yang, Zaihu Pang et al.

2026 6 citations View Analysis →

SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution

Xiangcheng Meng, Shu Wang, Yixiang Fang

2026 1 citations View Analysis →

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu, Jiabao Ji, Li An et al.

2026 21 citations View Analysis →

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Jingxing Wang, Chenyue Zhou, Zhihui Fu et al.

2026 2 citations View Analysis →

Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows

M. Lim, I. Danial, Bin Sharudin et al.

2026 1 citations View Analysis →

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Jiahao Ying, Bo Ai, Wei Tang et al.

2026 1 citations View Analysis →

SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

能力族 (Capability Family)

实用性评分 (Utility Score)

能力解析 (Capability Resolver)

查询调节 (Query-conditioned)

风险技能 (Risky Skill)

候选池 (Candidate Pool)

风险暴露 (Harmful Sibling Rate, HSR)

代表技能 (Representative Skill)

查询-disjoint拆分 (Query-disjoint Split)

线性模型 (Linear Model)

contract-profile cues

能力族关系 g⋆

Abstract

References (13)

Related Papers

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

Do Generative Recommenders Deepen the Information Cocoon? A Closed-Loop Simulation with LLM-powered User Simulators

A Theoretical Framework for Risk Analysis of Stochastic Rankers

CQC-RAG: Robust Retrieval-Augmented Generation via Cross-Query Consistency

miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity