AgentSearchBench: A Benchmark for AI Agent Search in the Wild
AgentSearchBench improves agent search ranking quality using execution signals, bridging the gap between semantics and performance.
Key Findings
Methodology
AgentSearchBench formalizes the agent search problem as retrieval and reranking tasks, using execution signals rather than textual similarity to evaluate relevance. Built from nearly 10,000 real-world agents, the benchmark supports both executable task queries and high-level task descriptions. By generating fine-grained relevance annotations through execution signals, AgentSearchBench provides a scalable evaluation pipeline.
Key Results
- Experiments reveal a consistent gap between semantic similarity and actual agent performance, highlighting the limitations of description-based retrieval and reranking methods.
- Lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, emphasizing the importance of incorporating execution signals into agent discovery.
- On task queries, tool-aware retrievers outperform sparse and dense baselines, while on task descriptions, dense retrievers become more competitive, with BGE achieving the strongest overall performance.
Significance
AgentSearchBench provides a large-scale benchmark for agent search in open ecosystems, revealing a significant semantic-performance gap. This study emphasizes the importance of incorporating execution signals into agent discovery pipelines, offering new perspectives for academia and industry, especially when dealing with abstract and multi-step tasks.
Technical Contribution
AgentSearchBench provides new technical contributions by formalizing agent search as a problem of execution-dependent capability uncertainty. The benchmark supports both executable task queries and high-level task descriptions, defining relevance through execution outcomes, offering a fundamental distinction from existing methods.
Novelty
AgentSearchBench is the first to formalize the agent search problem as an execution-dependent retrieval and reranking task, highlighting the gap between semantic similarity and actual performance. Unlike existing benchmarks, this study provides a more realistic agent search scenario through execution signals.
Limitations
- AgentSearchBench shows a significant performance drop when handling high-level task descriptions, indicating the difficulty of retrieving agents without explicit executable demands.
- Existing retrieval and reranking methods remain limited in capturing execution-dependent capabilities, especially for abstract and multi-step tasks.
Future Work
Future research directions include developing more robust execution-aware signals to further improve agent search ranking quality. Additionally, exploring how to apply AgentSearchBench in larger-scale and more complex task environments is a promising direction.
AI Executive Summary
The rapid development of AI agent systems is transforming how humans accomplish complex tasks, increasingly relying on these autonomous agents. However, selecting suitable agents for specific tasks has become a key challenge. Traditional tools are typically scoped to specific operations, while agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone.
Existing research and benchmarks often assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. To address this, we introduce AgentSearchBench, a large-scale benchmark built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, evaluating relevance using execution-grounded performance signals.
Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery.
AgentSearchBench provides a large-scale benchmark for agent search in open ecosystems, revealing a significant semantic-performance gap. This study emphasizes the importance of incorporating execution signals into agent discovery pipelines, offering new perspectives for academia and industry, especially when dealing with abstract and multi-step tasks.
Despite significant progress, AgentSearchBench shows a significant performance drop when handling high-level task descriptions, indicating the difficulty of retrieving agents without explicit executable demands. Future research directions include developing more robust execution-aware signals to further improve agent search ranking quality. Additionally, exploring how to apply AgentSearchBench in larger-scale and more complex task environments is a promising direction.
Deep Analysis
Background
With the rapid advancement of artificial intelligence technology, AI agent systems are increasingly being applied across various fields. These agents can not only reason and plan but also interact with external tools and services to complete multi-step objectives. The progress of modern agent systems has led to a rapidly expanding ecosystem of agentic components, ranging from general-purpose assistants to highly specialized task-oriented modules. As humans increasingly rely on agents developed by diverse third-party providers, a fundamental challenge arises: how to select suitable agents for a given task. Traditional tools are typically scoped to specific operations, while agent capabilities are often more compositional and execution-dependent, making them difficult to assess without observing task outcomes. Textual descriptions provide only a partial signal of real competence, as agents with similar descriptions may perform differently in practice, while semantically dissimilar agents can achieve comparable results. This semantic-performance misalignment is further amplified in large and open agent ecosystems, where overlapping functionalities and non-uniform description formats make capability comparison difficult. Consequently, agent search is fundamentally more complex than conventional tool retrieval or model selection.
Core Problem
The core problem of agent search lies in how to retrieve and rank suitable agents from a large candidate repository given a user task. Traditional information retrieval typically determines relevance through static content matching, whereas agent search requires assessing functional capability through task execution. Agent search operates under different levels of task specification, including executable task queries and high-level task descriptions. Executable task queries are concrete and can be directly evaluated by running an agent, while high-level task descriptions are not directly executable. To evaluate agent capability under these settings, each task description is associated with a set of executable task queries, which instantiate the high-level goal under different concrete scenarios. Agent relevance is then determined based on consistent performance across these task instances, rather than relying on textual similarity or single-task outcomes.
Innovation
The core innovations of AgentSearchBench include:
1. Formalizing agent search as an execution-dependent retrieval and reranking problem, highlighting the gap between semantic similarity and actual performance.
2. Constructing a large-scale benchmark with nearly 10,000 real-world agents, supporting both executable task queries and high-level task descriptions, and defining relevance through execution signals.
3. Providing a scalable evaluation pipeline that generates task instances and converts execution outcomes into fine-grained relevance annotations for retrieval and ranking assessment.
4. Demonstrating that lightweight behavioral signals, including execution-aware probing, can significantly improve ranking quality, emphasizing the importance of incorporating execution signals into agent discovery.
Methodology
- �� Constructing AgentSearchBench: Collecting nearly 10,000 real-world agents from multiple providers, forming a large-scale agent repository.
- �� Task Query Construction: Synthesizing executable task queries from agent documentation using document-grounded task generation methods.
- �� Relevance Annotation: Generating fine-grained relevance annotations through execution signals, evaluated using a 5-point LLM-as-judge.
- �� Task Description Construction: Constructing task descriptions by abstracting high-level objectives from clusters of semantically related queries.
- �� Retrieval and Reranking Evaluation: Using execution signals for retrieval and reranking evaluation, reporting precision, recall, NDCG, and completeness.
Experiments
The experimental design includes extensive benchmarking using AgentSearchBench to evaluate the performance of different retrieval and reranking methods on both executable task queries and high-level task descriptions. Baselines used include sparse, dense, tool-aware, and decoder embedding models. Experiments evaluate an average of 20 agents per query, with a total of 66,740 executions. Evaluation metrics include precision, recall, NDCG, and completeness. Experiments also explore the impact of lightweight behavioral signals on ranking quality.
Results
Experimental results show that on task queries, tool-aware retrievers outperform sparse and dense baselines, while on task descriptions, dense retrievers become more competitive, with BGE achieving the strongest overall performance. However, performance drops significantly when moving from executable queries to high-level task descriptions, and completeness remains low across all methods, highlighting the difficulty of retrieving agents that can fully satisfy abstract requirements. Results indicate that while retrieval can capture coarse relevance, it struggles to identify agents with comprehensive task-solving capability, especially under high-level task specifications without explicit executable demands.
Applications
Application scenarios for AgentSearchBench include:
1. Conducting agent search in open ecosystems, supporting both executable task queries and high-level task descriptions.
2. Providing a large-scale benchmark for academic research to evaluate the performance of different retrieval and reranking methods.
3. Offering a tool for industry to select suitable agents in complex task environments, especially when dealing with abstract and multi-step tasks.
Limitations & Outlook
Limitations of AgentSearchBench include:
1. A significant performance drop when handling high-level task descriptions, indicating the difficulty of retrieving agents without explicit executable demands.
2. Existing retrieval and reranking methods remain limited in capturing execution-dependent capabilities, especially for abstract and multi-step tasks.
3. Applying AgentSearchBench in larger-scale and more complex task environments remains a challenge, and future research needs to explore how to improve agent search ranking quality in these environments.
Plain Language Accessible to non-experts
Imagine you are in a gigantic supermarket with shelves full of various products. You need to find a specific item, like a rare spice. The traditional method is to look for the product based on its label and description, but sometimes these descriptions do not accurately reflect the product's actual effect. AgentSearchBench is like a smart shopping assistant that doesn't just rely on labels but evaluates these products by actually using them to see if they meet your needs.
In this supermarket, some products may have similar labels but perform very differently in practice. AgentSearchBench tests these products in real cooking scenarios to evaluate their actual performance. It's like having each spice participate in a cooking competition to see which one performs best in different dishes.
This way, AgentSearchBench can help you find the most suitable spice, not just based on the product's label and description. It considers not only the description but also the actual usage effect, providing you with more reliable choices.
The benefit of this approach is that it can identify products with similar labels but different effects and discover products with different labels but similar effects, offering you a more comprehensive shopping experience.
ELI14 Explained like you're 14
Imagine you're in a huge game store, looking for the coolest gear for your game character. The store has thousands of pieces of equipment, each with its own description and label. You might think, why not just choose based on the label? But the problem is, sometimes these labels don't accurately reflect the gear's actual effect.
AgentSearchBench is like a super smart game assistant. It doesn't just rely on the gear's label but helps you choose by actually testing how these pieces of equipment perform in the game. For example, it will test each piece of gear in different game scenarios to see which one performs best in battle.
This way, AgentSearchBench can help you find the most suitable gear, not just based on the label. It's like a 'gear judge' in the game, providing you with more reliable choices.
The benefit of this approach is that it can identify gear with similar labels but different effects and discover gear with different labels but similar effects, offering you a more comprehensive gaming experience.
Glossary
AgentSearchBench
AgentSearchBench is a large-scale benchmark for agent search in open ecosystems, supporting both executable task queries and high-level task descriptions.
In the paper, AgentSearchBench is used to evaluate the performance of different retrieval and reranking methods in agent search.
Execution Signals
Execution signals evaluate agent capabilities through their performance in actual tasks, rather than relying solely on textual descriptions.
In AgentSearchBench, execution signals are used to generate fine-grained relevance annotations.
Semantic Similarity
Semantic similarity refers to the degree of similarity between agent descriptions and task descriptions, but it does not always reflect actual agent performance.
Experiments reveal a consistent gap between semantic similarity and actual agent performance.
Lightweight Behavioral Signals
Lightweight behavioral signals enhance description-based ranking by incorporating execution performance, significantly improving ranking quality.
The study shows that lightweight behavioral signals can substantially improve ranking quality.
Executable Task Queries
Executable task queries are concrete instructions that can be directly evaluated by running an agent.
AgentSearchBench supports both executable task queries and high-level task descriptions.
High-Level Task Descriptions
High-level task descriptions are inputs that are not directly executable and require associated executable task queries to evaluate agent capabilities.
Retrieving agents under high-level task descriptions is more challenging.
Retrieval and Reranking
Retrieval and reranking involve retrieving and ranking suitable agents from a large candidate repository for a specific task.
AgentSearchBench formalizes agent search as retrieval and reranking problems.
Tool-Aware Retrievers
Tool-aware retrievers are retrieval methods that incorporate tool usage information, typically outperforming other baselines on executable task queries.
On task queries, tool-aware retrievers outperform sparse and dense baselines.
BGE
BGE is a dense retriever that performs more competitively on task descriptions, achieving the strongest overall performance.
On task descriptions, BGE achieves the strongest overall performance.
NDCG
NDCG is a metric for evaluating ranking quality, measuring the relevance and ranking accuracy of retrieval results.
Experiments use NDCG as one of the evaluation metrics.
Open Questions Unanswered questions from this research
- 1 How to improve agent search performance under high-level task descriptions without explicit executable demands remains an open question. Existing methods are limited in capturing execution-dependent capabilities, especially for abstract and multi-step tasks.
- 2 Applying AgentSearchBench in larger-scale and more complex task environments remains a challenge. Exploring how to improve agent search ranking quality in these environments is needed.
- 3 How to develop more robust execution-aware signals to further improve agent search ranking quality. Existing lightweight behavioral signals, while effective, may still have limitations in handling complex tasks.
- 4 How to better handle overlapping functionalities and non-uniform description formats in open ecosystems. Existing methods may be affected in these scenarios.
- 5 How to better combine textual descriptions with execution signals in agent search to provide a more comprehensive capability assessment. Existing methods may not fully integrate these aspects.
Applications
Immediate Applications
Agent Search in Open Ecosystems
AgentSearchBench can be used for agent search in open ecosystems, supporting both executable task queries and high-level task descriptions, providing support for academic research and industrial applications.
Agent Capability Assessment
Assess agent capabilities through execution signals, providing developers and users with more accurate capability assessments to help select suitable agents.
Ranking Quality Optimization
Optimize agent search ranking quality through lightweight behavioral signals, improving the performance of retrieval and reranking methods, especially when handling complex tasks.
Long-term Vision
Application in Complex Task Environments
Explore how to apply AgentSearchBench in larger-scale and more complex task environments to improve agent search ranking quality.
Development of Execution-Aware Signals
Develop more robust execution-aware signals to further improve agent search ranking quality, especially when handling abstract and multi-step tasks.
Abstract
The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.
References (20)
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shi Liang, Yining Ye et al.
A Survey on LLM-as-a-Judge
Jiawei Gu, Xuhui Jiang, Zhichao Shi et al.
Natural Language Inference as a Judge: Detecting Factuality and Causality Issues in Language Model Self-Reasoning for Financial Analysis
Yilin Wu, Han Yuan, Li Zhang et al.
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
Zhengliang Shi, Yuhan Wang, Lingyong Yan et al.
Towards Completeness-Oriented Tool Retrieval for Large Language Models
Changle Qu, Sunhao Dai, Xiaochi Wei et al.
Tools are under-documented: Simple Document Expansion Boosts Tool Retrieval
Xuan Lu, Haohang Huang, Rui Meng et al.
A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents
Bin Wu, E. Meij, Emine Yilmaz
ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval
Yuanhang Zheng, Peng Li, Wei Liu et al.
Document Ranking with a Pretrained Sequence-to-Sequence Model
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep et al.
AgentSquare: Automatic LLM Agent Search in Modular Design Space
Yu Shang, Yu Li, Keyu Zhao et al.
MasRouter: Learning to Route LLMs for Multi-Agent Systems
Yanwei Yue, Gui-Min Zhang, Boyang Liu et al.
OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
Mengkang Hu, Yuhang Zhou, Wendong Fan et al.
Fine-Tuning LLaMA for Multi-Stage Text Retrieval
Xueguang Ma, Liang Wang, Nan Yang et al.
Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection
Michelle Yuan, Khushbu Pahwa, Shuaichen Chang et al.
Large Dual Encoders Are Generalizable Retrievers
Jianmo Ni, Chen Qu, Jing Lu et al.
PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play
Wei-Wen Fang, Yang Zhang, Kaizhi Qian et al.
C-Pack: Packed Resources For General Chinese Embeddings
Shitao Xiao, Zheng Liu, Peitian Zhang et al.
Improving Text Embeddings with Large Language Models
Liang Wang, Nan Yang, Xiaolong Huang et al.
Multi-Field Tool Retrieval
Yichen Tang, Weihang Su, Yiqun Liu et al.
Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks
Antoine Bigeard, Langston Nashold, R. Krishnan et al.