AgentSearchBench: A Benchmark for AI Agent Search in the Wild

TL;DR

AgentSearchBench improves agent search ranking quality using execution signals, bridging the gap between semantics and performance.

cs.AI 🔴 Advanced 2026-04-24 28 views

Bin Wu Arastun Mammadli Xiaoyu Zhang Emine Yilmaz

AI Reader Arxiv Page Download PDF

AI agents search benchmark execution signals semantic gap ranking optimization

Key Findings

Methodology

AgentSearchBench formalizes the agent search problem as retrieval and reranking tasks, using execution signals rather than textual similarity to evaluate relevance. Built from nearly 10,000 real-world agents, the benchmark supports both executable task queries and high-level task descriptions. By generating fine-grained relevance annotations through execution signals, AgentSearchBench provides a scalable evaluation pipeline.

Key Results

Experiments reveal a consistent gap between semantic similarity and actual agent performance, highlighting the limitations of description-based retrieval and reranking methods.
Lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, emphasizing the importance of incorporating execution signals into agent discovery.
On task queries, tool-aware retrievers outperform sparse and dense baselines, while on task descriptions, dense retrievers become more competitive, with BGE achieving the strongest overall performance.

Significance

AgentSearchBench provides a large-scale benchmark for agent search in open ecosystems, revealing a significant semantic-performance gap. This study emphasizes the importance of incorporating execution signals into agent discovery pipelines, offering new perspectives for academia and industry, especially when dealing with abstract and multi-step tasks.

Technical Contribution

AgentSearchBench provides new technical contributions by formalizing agent search as a problem of execution-dependent capability uncertainty. The benchmark supports both executable task queries and high-level task descriptions, defining relevance through execution outcomes, offering a fundamental distinction from existing methods.

Novelty

AgentSearchBench is the first to formalize the agent search problem as an execution-dependent retrieval and reranking task, highlighting the gap between semantic similarity and actual performance. Unlike existing benchmarks, this study provides a more realistic agent search scenario through execution signals.

Limitations

AgentSearchBench shows a significant performance drop when handling high-level task descriptions, indicating the difficulty of retrieving agents without explicit executable demands.
Existing retrieval and reranking methods remain limited in capturing execution-dependent capabilities, especially for abstract and multi-step tasks.

Future Work

Future research directions include developing more robust execution-aware signals to further improve agent search ranking quality. Additionally, exploring how to apply AgentSearchBench in larger-scale and more complex task environments is a promising direction.

AI Executive Summary

The rapid development of AI agent systems is transforming how humans accomplish complex tasks, increasingly relying on these autonomous agents. However, selecting suitable agents for specific tasks has become a key challenge. Traditional tools are typically scoped to specific operations, while agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone.

Existing research and benchmarks often assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. To address this, we introduce AgentSearchBench, a large-scale benchmark built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, evaluating relevance using execution-grounded performance signals.

Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery.

Despite significant progress, AgentSearchBench shows a significant performance drop when handling high-level task descriptions, indicating the difficulty of retrieving agents without explicit executable demands. Future research directions include developing more robust execution-aware signals to further improve agent search ranking quality. Additionally, exploring how to apply AgentSearchBench in larger-scale and more complex task environments is a promising direction.

Deep Analysis

Background

With the rapid advancement of artificial intelligence technology, AI agent systems are increasingly being applied across various fields. These agents can not only reason and plan but also interact with external tools and services to complete multi-step objectives. The progress of modern agent systems has led to a rapidly expanding ecosystem of agentic components, ranging from general-purpose assistants to highly specialized task-oriented modules. As humans increasingly rely on agents developed by diverse third-party providers, a fundamental challenge arises: how to select suitable agents for a given task. Traditional tools are typically scoped to specific operations, while agent capabilities are often more compositional and execution-dependent, making them difficult to assess without observing task outcomes. Textual descriptions provide only a partial signal of real competence, as agents with similar descriptions may perform differently in practice, while semantically dissimilar agents can achieve comparable results. This semantic-performance misalignment is further amplified in large and open agent ecosystems, where overlapping functionalities and non-uniform description formats make capability comparison difficult. Consequently, agent search is fundamentally more complex than conventional tool retrieval or model selection.

Core Problem

The core problem of agent search lies in how to retrieve and rank suitable agents from a large candidate repository given a user task. Traditional information retrieval typically determines relevance through static content matching, whereas agent search requires assessing functional capability through task execution. Agent search operates under different levels of task specification, including executable task queries and high-level task descriptions. Executable task queries are concrete and can be directly evaluated by running an agent, while high-level task descriptions are not directly executable. To evaluate agent capability under these settings, each task description is associated with a set of executable task queries, which instantiate the high-level goal under different concrete scenarios. Agent relevance is then determined based on consistent performance across these task instances, rather than relying on textual similarity or single-task outcomes.

Innovation

The core innovations of AgentSearchBench include:

1. Formalizing agent search as an execution-dependent retrieval and reranking problem, highlighting the gap between semantic similarity and actual performance.

2. Constructing a large-scale benchmark with nearly 10,000 real-world agents, supporting both executable task queries and high-level task descriptions, and defining relevance through execution signals.

3. Providing a scalable evaluation pipeline that generates task instances and converts execution outcomes into fine-grained relevance annotations for retrieval and ranking assessment.

4. Demonstrating that lightweight behavioral signals, including execution-aware probing, can significantly improve ranking quality, emphasizing the importance of incorporating execution signals into agent discovery.

Methodology

�� Constructing AgentSearchBench: Collecting nearly 10,000 real-world agents from multiple providers, forming a large-scale agent repository.
�� Task Query Construction: Synthesizing executable task queries from agent documentation using document-grounded task generation methods.
�� Relevance Annotation: Generating fine-grained relevance annotations through execution signals, evaluated using a 5-point LLM-as-judge.
�� Task Description Construction: Constructing task descriptions by abstracting high-level objectives from clusters of semantically related queries.
�� Retrieval and Reranking Evaluation: Using execution signals for retrieval and reranking evaluation, reporting precision, recall, NDCG, and completeness.

Experiments

The experimental design includes extensive benchmarking using AgentSearchBench to evaluate the performance of different retrieval and reranking methods on both executable task queries and high-level task descriptions. Baselines used include sparse, dense, tool-aware, and decoder embedding models. Experiments evaluate an average of 20 agents per query, with a total of 66,740 executions. Evaluation metrics include precision, recall, NDCG, and completeness. Experiments also explore the impact of lightweight behavioral signals on ranking quality.

Results

Experimental results show that on task queries, tool-aware retrievers outperform sparse and dense baselines, while on task descriptions, dense retrievers become more competitive, with BGE achieving the strongest overall performance. However, performance drops significantly when moving from executable queries to high-level task descriptions, and completeness remains low across all methods, highlighting the difficulty of retrieving agents that can fully satisfy abstract requirements. Results indicate that while retrieval can capture coarse relevance, it struggles to identify agents with comprehensive task-solving capability, especially under high-level task specifications without explicit executable demands.

Applications

Application scenarios for AgentSearchBench include:

1. Conducting agent search in open ecosystems, supporting both executable task queries and high-level task descriptions.

2. Providing a large-scale benchmark for academic research to evaluate the performance of different retrieval and reranking methods.

3. Offering a tool for industry to select suitable agents in complex task environments, especially when dealing with abstract and multi-step tasks.

Limitations & Outlook

Limitations of AgentSearchBench include:

1. A significant performance drop when handling high-level task descriptions, indicating the difficulty of retrieving agents without explicit executable demands.

2. Existing retrieval and reranking methods remain limited in capturing execution-dependent capabilities, especially for abstract and multi-step tasks.

3. Applying AgentSearchBench in larger-scale and more complex task environments remains a challenge, and future research needs to explore how to improve agent search ranking quality in these environments.

Plain Language Accessible to non-experts

Imagine you are in a gigantic supermarket with shelves full of various products. You need to find a specific item, like a rare spice. The traditional method is to look for the product based on its label and description, but sometimes these descriptions do not accurately reflect the product's actual effect. AgentSearchBench is like a smart shopping assistant that doesn't just rely on labels but evaluates these products by actually using them to see if they meet your needs.

In this supermarket, some products may have similar labels but perform very differently in practice. AgentSearchBench tests these products in real cooking scenarios to evaluate their actual performance. It's like having each spice participate in a cooking competition to see which one performs best in different dishes.

This way, AgentSearchBench can help you find the most suitable spice, not just based on the product's label and description. It considers not only the description but also the actual usage effect, providing you with more reliable choices.

The benefit of this approach is that it can identify products with similar labels but different effects and discover products with different labels but similar effects, offering you a more comprehensive shopping experience.

ELI14 Explained like you're 14

Imagine you're in a huge game store, looking for the coolest gear for your game character. The store has thousands of pieces of equipment, each with its own description and label. You might think, why not just choose based on the label? But the problem is, sometimes these labels don't accurately reflect the gear's actual effect.

AgentSearchBench is like a super smart game assistant. It doesn't just rely on the gear's label but helps you choose by actually testing how these pieces of equipment perform in the game. For example, it will test each piece of gear in different game scenarios to see which one performs best in battle.

This way, AgentSearchBench can help you find the most suitable gear, not just based on the label. It's like a 'gear judge' in the game, providing you with more reliable choices.

The benefit of this approach is that it can identify gear with similar labels but different effects and discover gear with different labels but similar effects, offering you a more comprehensive gaming experience.

Glossary

AgentSearchBench

AgentSearchBench is a large-scale benchmark for agent search in open ecosystems, supporting both executable task queries and high-level task descriptions.

In the paper, AgentSearchBench is used to evaluate the performance of different retrieval and reranking methods in agent search.

Execution Signals

Execution signals evaluate agent capabilities through their performance in actual tasks, rather than relying solely on textual descriptions.

In AgentSearchBench, execution signals are used to generate fine-grained relevance annotations.

Semantic Similarity

Semantic similarity refers to the degree of similarity between agent descriptions and task descriptions, but it does not always reflect actual agent performance.

Experiments reveal a consistent gap between semantic similarity and actual agent performance.

Lightweight Behavioral Signals

Lightweight behavioral signals enhance description-based ranking by incorporating execution performance, significantly improving ranking quality.

The study shows that lightweight behavioral signals can substantially improve ranking quality.

Executable Task Queries

Executable task queries are concrete instructions that can be directly evaluated by running an agent.

AgentSearchBench supports both executable task queries and high-level task descriptions.

High-Level Task Descriptions

High-level task descriptions are inputs that are not directly executable and require associated executable task queries to evaluate agent capabilities.

Retrieving agents under high-level task descriptions is more challenging.

Retrieval and Reranking

Retrieval and reranking involve retrieving and ranking suitable agents from a large candidate repository for a specific task.

AgentSearchBench formalizes agent search as retrieval and reranking problems.

Tool-Aware Retrievers

Tool-aware retrievers are retrieval methods that incorporate tool usage information, typically outperforming other baselines on executable task queries.

On task queries, tool-aware retrievers outperform sparse and dense baselines.

BGE

BGE is a dense retriever that performs more competitively on task descriptions, achieving the strongest overall performance.

On task descriptions, BGE achieves the strongest overall performance.

NDCG

NDCG is a metric for evaluating ranking quality, measuring the relevance and ranking accuracy of retrieval results.

Experiments use NDCG as one of the evaluation metrics.

Open Questions Unanswered questions from this research

1 How to improve agent search performance under high-level task descriptions without explicit executable demands remains an open question. Existing methods are limited in capturing execution-dependent capabilities, especially for abstract and multi-step tasks.
2 Applying AgentSearchBench in larger-scale and more complex task environments remains a challenge. Exploring how to improve agent search ranking quality in these environments is needed.
3 How to develop more robust execution-aware signals to further improve agent search ranking quality. Existing lightweight behavioral signals, while effective, may still have limitations in handling complex tasks.
4 How to better handle overlapping functionalities and non-uniform description formats in open ecosystems. Existing methods may be affected in these scenarios.
5 How to better combine textual descriptions with execution signals in agent search to provide a more comprehensive capability assessment. Existing methods may not fully integrate these aspects.

Applications

Immediate Applications

Agent Search in Open Ecosystems

AgentSearchBench can be used for agent search in open ecosystems, supporting both executable task queries and high-level task descriptions, providing support for academic research and industrial applications.

Agent Capability Assessment

Assess agent capabilities through execution signals, providing developers and users with more accurate capability assessments to help select suitable agents.

Ranking Quality Optimization

Optimize agent search ranking quality through lightweight behavioral signals, improving the performance of retrieval and reranking methods, especially when handling complex tasks.

Long-term Vision

Application in Complex Task Environments

Explore how to apply AgentSearchBench in larger-scale and more complex task environments to improve agent search ranking quality.

Development of Execution-Aware Signals

Develop more robust execution-aware signals to further improve agent search ranking quality, especially when handling abstract and multi-step tasks.

Abstract

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.

cs.AI cs.IR cs.MA

References (20)

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shi Liang, Yining Ye et al.

2023 1431 citations ⭐ Influential View Analysis →

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi et al.

2024 1242 citations ⭐ Influential View Analysis →

Natural Language Inference as a Judge: Detecting Factuality and Causality Issues in Language Model Self-Reasoning for Financial Analysis

Yilin Wu, Han Yuan, Li Zhang et al.

2025 3 citations ⭐ Influential

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

Zhengliang Shi, Yuhan Wang, Lingyong Yan et al.

2025 17 citations ⭐ Influential View Analysis →

Towards Completeness-Oriented Tool Retrieval for Large Language Models

Changle Qu, Sunhao Dai, Xiaochi Wei et al.

2024 21 citations ⭐ Influential View Analysis →

Tools are under-documented: Simple Document Expansion Boosts Tool Retrieval

Xuan Lu, Haohang Huang, Rui Meng et al.

2025 5 citations View Analysis →

A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents

Bin Wu, E. Meij, Emine Yilmaz

2025 9 citations

ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval

Yuanhang Zheng, Peng Li, Wei Liu et al.

2024 33 citations View Analysis →

Document Ranking with a Pretrained Sequence-to-Sequence Model

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep et al.

2020 756 citations View Analysis →

AgentSquare: Automatic LLM Agent Search in Modular Design Space

Yu Shang, Yu Li, Keyu Zhao et al.

2024 77 citations View Analysis →

MasRouter: Learning to Route LLMs for Multi-Agent Systems

Yanwei Yue, Gui-Min Zhang, Boyang Liu et al.

2025 43 citations View Analysis →

OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

Mengkang Hu, Yuhang Zhou, Wendong Fan et al.

2025 109 citations View Analysis →

Fine-Tuning LLaMA for Multi-Stage Text Retrieval

Xueguang Ma, Liang Wang, Nan Yang et al.

2023 379 citations View Analysis →

Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection

Michelle Yuan, Khushbu Pahwa, Shuaichen Chang et al.

2025 3 citations View Analysis →

Large Dual Encoders Are Generalizable Retrievers

Jianmo Ni, Chen Qu, Jing Lu et al.

2021 599 citations View Analysis →

PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

Wei-Wen Fang, Yang Zhang, Kaizhi Qian et al.

2025 7 citations View Analysis →

C-Pack: Packed Resources For General Chinese Embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang et al.

2023 525 citations View Analysis →

Improving Text Embeddings with Large Language Models

Liang Wang, Nan Yang, Xiaolong Huang et al.

2023 348 citations View Analysis →

Multi-Field Tool Retrieval

Yichen Tang, Weihang Su, Yiqun Liu et al.

2026 1 citations View Analysis →

Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks

Antoine Bigeard, Langston Nashold, R. Krishnan et al.

2025 21 citations View Analysis →

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

AgentSearchBench

Execution Signals

Semantic Similarity

Lightweight Behavioral Signals

Executable Task Queries

High-Level Task Descriptions

Retrieval and Reranking

Tool-Aware Retrievers

BGE

NDCG

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Agent Search in Open Ecosystems

Agent Capability Assessment

Ranking Quality Optimization

Long-term Vision

Application in Complex Task Environments

Development of Execution-Aware Signals

Abstract

References (20)

Related Papers

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Large Language Models Exhibit Normative Conformity

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval