ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

TL;DR

ScholarQuest introduces a taxonomy-guided benchmark with over 1000 CS topics, four research intents, and automated answer construction, advancing systematic evaluation of academic search agents.

cs.IR 🔴 Advanced 2026-06-18 10 views
Tingyue Pan Mingyue Cheng Daoyu Wang Yitong Zhou Jie Ouyang Qi Liu Enhong Chen
academic retrieval benchmark large-scale dataset taxonomy-guided multi-turn search

Key Findings

Methodology

This work presents ScholarQuest, a comprehensive benchmark for evaluating academic paper search systems. It leverages a hierarchical taxonomy derived from over 1000 computer science topics, and defines four core research intents—method-oriented, setting-anchored, comparison-based, and scope-controlled queries—to ensure diverse and controlled query generation. An automated pipeline integrates multi-source retrieval (including BM25, dense embedding via BGE-M3, and hybrid RRF), citation graph expansion, relevance filtering, and quality verification, creating ScholarBase—a million-scale, standardized literature environment. The evaluation framework employs multi-round decision-making models, combining retrieval algorithms with relevance judgment modules, to simulate agentic search behavior. Extensive experiments demonstrate that agentic methods outperform single-shot baselines, with the best achieving Recall@100 of 0.314, yet significant room remains for improvement, especially in complex query scenarios.

Key Results

  • The benchmark results show that the top agentic method, PaperScout, achieves Recall@100 of 0.314 and Recall@All of 0.355, substantially higher than traditional retrieval systems like Google Scholar (Recall@100 of 0.010). Despite this, the overall performance indicates that current models still struggle with complex, constrained queries, highlighting the need for further advancements in multi-turn reasoning and knowledge integration.
  • Analysis across different query types reveals that method-oriented and setting-anchored queries perform relatively well (Recall@100 > 0.3), whereas scope-controlled queries remain challenging (Recall@100 ≈ 0.19). This underscores the difficulty in maintaining precise scope boundaries during iterative exploration.
  • Efficiency analysis indicates that adaptive tool-use policies, as exemplified by PaperScout, lead to higher recall efficiency (up to 0.120 per 100 candidates). The number of interaction rounds, tool calls, and candidate observations directly influence retrieval success, emphasizing the importance of intelligent exploration strategies.

Significance

This work addresses a critical gap in the evaluation of intelligent academic search systems by providing a standardized, multi-dimensional benchmark capable of assessing retrieval quality, efficiency, robustness, and exploration behavior. It facilitates fair comparison across diverse models and algorithms, fostering innovation in multi-turn, intent-aware literature exploration. The benchmark's comprehensive design supports future research in integrating knowledge graphs, reasoning modules, and cross-disciplinary retrieval, ultimately accelerating scientific discovery and reducing information overload. By enabling reproducible experiments and detailed diagnostics, ScholarQuest paves the way for developing more effective, autonomous academic search agents that can adapt to complex research needs.

Technical Contribution

The primary technical innovation lies in the systematic integration of a hierarchical taxonomy-guided query generation process with an automated, multi-source retrieval pipeline that includes citation graph expansion and relevance filtering. The framework employs advanced algorithms such as BM25 for sparse retrieval, BGE-M3 embeddings for dense retrieval, and RRF for hybrid scoring, combined within a multi-round decision framework that mimics agentic exploration. The automated answer construction pipeline reduces manual annotation costs while maintaining high coverage and precision. Additionally, the benchmark introduces a multi-dimensional evaluation suite covering recall, efficiency, robustness, and failure analysis, providing a comprehensive assessment platform for academic search systems.

Novelty

This study is the first to systematically incorporate a taxonomy-guided, multi-intent query generation framework into a large-scale, automated benchmark for academic paper search. Unlike prior datasets (e.g., AutoScholar, RealScholar), ScholarQuest emphasizes controlled query intents, broad topic coverage, and reproducible environment setup. Its automated pipeline for answer set construction and multi-source retrieval, combined with detailed diagnostic metrics, offers a novel, holistic approach to evaluating and advancing agentic literature exploration. This comprehensive design enables nuanced analysis of search behaviors and failure modes, setting a new standard for academic search benchmarking.

Limitations

  • Despite its scale, the automated answer construction relies heavily on citation relations and relevance filtering, which may introduce biases or miss relevant papers lacking explicit citation links, especially in emerging or interdisciplinary fields.
  • Current retrieval algorithms primarily focus on lexical and embedding similarity, with limited incorporation of graph-structured knowledge or semantic reasoning, constraining performance in highly complex or scope-sensitive queries.
  • Multi-round exploration models still face challenges in maintaining scope boundaries and avoiding off-target retrieval, indicating the need for more sophisticated constraint-aware reasoning mechanisms.

Future Work

Future efforts will focus on integrating knowledge graphs and reasoning modules to better capture complex relationships among papers. Enhancing the scope-awareness of multi-round models through constraint-aware algorithms and reinforcement learning strategies is also planned. Expanding the benchmark to include other disciplines beyond computer science will facilitate cross-domain research. Additionally, deploying these systems in real-world academic environments and conducting user studies will validate their practical utility and guide further improvements in autonomous literature exploration.

AI Executive Summary

The process of scientific discovery heavily relies on effective literature search. Traditional keyword-based retrieval methods, while efficient at handling large-scale datasets, often fall short when dealing with complex, nuanced research queries that require multi-faceted understanding and reasoning. As the volume of scholarly publications continues to grow exponentially, the need for intelligent, multi-turn search agents becomes increasingly urgent.

Recent advances in large language models (LLMs) and autonomous search agents have opened new horizons for literature exploration. These systems aim to simulate human-like inquiry, iteratively refining search results based on accumulated evidence, and navigating scholarly networks through citation relations and contextual cues. However, evaluating such agentic systems presents unique challenges. Existing benchmarks are limited in scope, often relying on manually curated queries or small datasets, which hinder comprehensive assessment of their robustness, efficiency, and generalization.

Addressing this gap, ScholarQuest introduces a large-scale, taxonomy-guided benchmark designed specifically for agentic academic paper search. It encompasses over 1000 computer science topics derived from hierarchical classifications, and incorporates four research intent categories—method-oriented, setting-anchored, comparison-based, and scope-controlled queries—reflecting real-world search scenarios. The benchmark employs an automated pipeline that integrates multi-source retrieval (including BM25, dense embedding, and hybrid methods), citation graph expansion, relevance filtering, and quality verification, resulting in a high-quality, scalable answer set stored in ScholarBase.

Experimental evaluations demonstrate that agentic methods, such as PaperScout, outperform traditional single-shot retrieval baselines like Google Scholar and Semantic Scholar. For instance, PaperScout achieves a Recall@100 of 0.314 and Recall@All of 0.355, significantly higher than baseline systems. Nonetheless, the overall performance indicates substantial room for improvement, especially in complex scope-controlled queries where recall remains below 0.19. Analysis reveals that adaptive tool-use strategies, multi-round decision-making, and citation expansion are crucial for enhancing retrieval effectiveness.

The significance of ScholarQuest lies in its comprehensive, multi-dimensional evaluation framework. It not only measures retrieval accuracy but also assesses search efficiency, robustness across different query intents, and failure modes. This holistic approach provides valuable insights into the strengths and limitations of current systems, guiding future research toward more intelligent, constraint-aware, and knowledge-integrated search agents. The benchmark’s open-source nature encourages community participation, fostering innovation and standardization in academic information retrieval.

Looking ahead, integrating knowledge graphs, reasoning modules, and cross-disciplinary datasets will further advance the capabilities of autonomous search agents. The ultimate goal is to develop systems that can seamlessly assist researchers in navigating the ever-expanding scientific literature, accelerating discovery, and reducing information overload. Despite current limitations, ScholarQuest marks a significant step toward realizing intelligent, scalable, and reliable academic search solutions for the future of science.

Deep Dive

Abstract

Academic paper search is a core step in scientific research, and LLM-based search agents are emerging as a promising paradigm for iterative, intent-driven literature exploration. However, existing benchmarks are insufficient for systematically evaluating agentic academic search under realistic open literature environments. We propose ScholarQuest, a large-scale, taxonomy-guided benchmark for agentic academic paper search. ScholarQuest is constructed from over 1,000 computer science topics and four representative research intents, including method-oriented, setting-anchored, comparison-based, and scope-controlled queries. It further provides scalable answer construction and a shared retrieval backend ScholarBase for reproducible evaluation. Benchmarking results show that agentic methods outperform single-shot retrieval baselines, yet the best-performing agent only achieves 0.314 Recall@100 and 0.355 Recall@All, indicating substantial room for improvement. In addition, analyses of search efficiency, intent-level robustness, and failure cases further highlight the benchmark's ability to provide multi-dimensional evaluation signals for academic paper search agents.

cs.IR cs.AI