Natural Language Query to Configuration for Retrieval Agents
BRANE uses LLM-extracted query features to dynamically optimize retrieval configurations, achieving up to 89% cost savings on MuSiQue and others.
Key Findings
Methodology
This paper introduces the BRANE framework, which leverages a large language model (LLM) to extract workload-specific binary features from natural language queries, serving as a semantic bridge to the retrieval pipeline configuration space. For each predefined pipeline configuration, a lightweight binary classifier is trained to estimate the probability that the pipeline answers the query correctly. At inference, BRANE selects the configuration maximizing a Lagrangian score balancing predicted accuracy and cost, enabling a tunable cost-quality tradeoff without retraining. The approach involves offline profiling of configurations, query characterization, predictor training, and Lagrangian-based inference selection, covering multiple retrievers, LLMs, document counts, and synthesis strategies, significantly outperforming static and LLM-routing baselines.
Key Results
- On MuSiQue, BrowseComp-Plus, and FinanceBench benchmarks, BRANE achieves up to 89.4% cost savings while matching the best static configuration's accuracy, averaging 59.7% cost reduction.
- BRANE consistently pushes the cost-accuracy Pareto frontier beyond state-of-the-art baselines including Carrot (LLM routing), METIS (rule-based routing), Adaptive-RAG (retrieval strategy selection), and fine-tuned BERT and Qwen3-4B models.
- Ablation studies demonstrate that LLM-proposed workload-specific binary features outperform generic semantic embeddings as predictor inputs, and BRANE is robust to the choice of characterization LLM and feature set size.
Significance
This work pioneers systematic per-query dynamic configuration of the full retrieval pipeline, overcoming limitations of static workload-level tuning. By significantly reducing inference costs while maintaining or improving answer quality, BRANE addresses the combinatorial complexity and interaction effects in multi-knob retrieval systems. It offers a practical, scalable solution for industrial-scale knowledge retrieval, enabling more efficient, personalized, and cost-effective intelligent QA and search services.
Technical Contribution
BRANE introduces an innovative mechanism for automatic discovery of workload-specific binary query features via LLMs, effectively capturing complex, non-linear relationships between queries and pipeline configurations. Training independent lightweight predictors per configuration avoids the complexity and data demands of joint modeling. The use of Lagrangian relaxation enables flexible cost-accuracy tradeoffs at inference. Additionally, fuzzy Pareto pruning reduces training and inference overhead, enhancing scalability and practical deployment.
Novelty
BRANE is the first end-to-end framework mapping natural language queries to full retrieval pipeline configurations, extending beyond prior work limited to LLM model selection. Its key innovation lies in leveraging LLMs to automatically generate workload-specific binary query features, enabling fine-grained query differentiation and personalized configuration selection, which substantially improves efficiency and accuracy over fixed or generic feature approaches.
Limitations
- BRANE relies on costly offline profiling of configuration performance, which can be expensive and time-consuming especially for very large configuration spaces.
- The approach assumes stable relationships between query features and configuration accuracy; dynamic or evolving workloads may require frequent retraining to maintain effectiveness.
- While evaluated on multiple benchmarks, BRANE's adaptability to extremely complex queries or cross-domain generalization remains to be fully validated.
Future Work
Future directions include reducing offline profiling costs, developing online learning and adaptive update mechanisms to handle dynamic workloads, enriching query feature representations with deeper semantic and multimodal signals, and extending BRANE to cross-domain and multilingual settings to enhance robustness and universality.
AI Executive Summary
Modern knowledge retrieval systems have grown increasingly complex, involving multiple interacting components such as large language models (LLMs), retrievers, document counts, multi-hop reasoning steps, and synthesis strategies. Configuring these pipelines involves a high-dimensional combinatorial space, where each choice impacts both answer quality and serving cost. Traditional approaches typically rely on static, workload-level tuning, applying a fixed configuration for all queries. This neglects the substantial variability in query complexity and structure, leading to suboptimal resource utilization and performance.
To address this challenge, the authors propose BRANE, a novel framework that dynamically selects the optimal retrieval pipeline configuration on a per-query basis. BRANE first employs a powerful LLM to extract workload-specific binary features from queries, capturing fine-grained semantic and structural characteristics relevant to configuration choice. For each candidate pipeline configuration, a lightweight binary classifier is trained offline to predict the probability of producing a correct answer for a given query feature vector.
At inference, BRANE computes a Lagrangian score combining predicted accuracy and estimated cost for each configuration, selecting the one that maximizes this tradeoff according to a tunable parameter λ. This enables flexible balancing of cost and quality without retraining. The framework incorporates fuzzy Pareto pruning to focus training on near-optimal configurations, reducing computational overhead.
Extensive experiments on three public benchmarks—MuSiQue, BrowseComp-Plus, and FinanceBench—demonstrate that BRANE consistently outperforms static configurations and state-of-the-art baselines including LLM routing (Carrot), rule-based routing (METIS), retrieval strategy selection (Adaptive-RAG), and fine-tuned end-to-end models (BERT, Qwen3-4B). BRANE achieves up to 89.4% cost savings at matched accuracy, pushing the cost-quality Pareto frontier significantly.
This work represents a significant advance in knowledge retrieval system design, enabling per-query adaptive configuration that optimizes resource use and answer quality. While promising, BRANE currently depends on costly offline profiling and assumes workload stability. Future work will explore online adaptation, richer query representations, and broader applicability across domains and languages.
Deep Analysis
Background
The field of knowledge retrieval has evolved rapidly with the advent of large language models (LLMs) and sophisticated information retrieval techniques. Modern systems integrate multi-step retrieval and generation pipelines to answer complex queries by grounding responses in external evidence. Representative systems such as Perplexity, ChatGPT, Gemini, and Claude combine web-scale corpora retrieval with LLM generation to enhance answer accuracy. However, these pipelines involve numerous configuration choices—selecting the LLM variant, retriever type, number of retrieved documents, retrieval depth (hops), and synthesis strategy—that jointly influence both answer quality and serving cost. Existing solutions predominantly apply static configurations tuned per workload, failing to exploit the heterogeneity among individual queries. Recent LLM routing approaches dynamically select models based on query semantics but focus narrowly on model choice, ignoring the broader configuration space. This leaves a large optimization opportunity unaddressed, motivating the need for a systematic, per-query configuration framework.
Core Problem
The core problem addressed is how to dynamically select the optimal retrieval pipeline configuration for each natural language query, from a large predefined configuration space, to minimize expected serving cost while meeting an accuracy target (or vice versa). Challenges include the combinatorial explosion of configuration options and their complex, sometimes non-monotonic interactions; the noisy and high-dimensional nature of natural language queries that complicates direct mapping to configurations; and the high cost of exhaustive offline profiling across configurations and queries. Static workload-level tuning ignores per-query variability, resulting in either wasted resources or degraded accuracy. A principled, scalable method to map queries to configurations that balances cost and quality is thus critical for efficient knowledge retrieval.
Innovation
This work introduces three key innovations:
1. Workload-Specific Query Feature Discovery: Utilizing a state-of-the-art LLM (gpt-4o) to automatically generate a set of binary query features tailored to the workload, capturing nuanced semantic and structural aspects relevant for configuration decisions. This contrasts with prior fixed or generic features that collapse query diversity.
2. Per-Configuration Lightweight Predictors: Training independent binary classifiers per configuration to estimate the probability of correct answers given query features, avoiding the complexity and data demands of joint modeling over the entire configuration space. This modular approach supports scalable training and easy extension.
3. Lagrangian Relaxation-Based Inference: Employing a Lagrangian formulation to combine predicted accuracy and cost into a single score, enabling flexible tradeoffs controlled by a tunable parameter λ. This allows BRANE to trace the cost-quality Pareto frontier and adapt to user-specified accuracy or budget targets dynamically.
Methodology
- �� Offline Configuration Profiling: Sample N queries from the workload and execute each against all C configurations, recording binary correctness and cost metrics to build accuracy and cost matrices.
- �� Workload-Specific Feature Generation: Prompt gpt-4o with example queries to propose d binary features (e.g., requires_multi_hop, involves_regional_cuisine), each answerable by yes/no from query text.
- �� Query Feature Annotation: Use a smaller, cheaper LLM to label all queries with these binary features, producing feature vectors Fq.
- �� Predictor Training: For each configuration surviving fuzzy Pareto pruning, train a lightweight binary classifier (logistic regression, decision tree, random forest, XGBoost, LightGBM) on (Fq, correctness) pairs, selecting the best model via cross-validated negative log-loss.
- �� Inference-Time Selection: For a new query q, compute Fq, evaluate predicted correctness ˆc(Fq) for all Pareto configurations, and select configuration c maximizing ˆc(Fq) - λ·cost(c), where λ balances accuracy and cost.
- �� Fuzzy Pareto Pruning: Retain configurations near the cost-accuracy Pareto frontier within tolerances τacc and τcost to reduce training and inference overhead while maintaining performance.
Experiments
Experiments were conducted on three public knowledge search benchmarks: MuSiQue, BrowseComp-Plus, and FinanceBench, covering diverse domains and query types. Each benchmark includes 150 to 600 queries. The configuration space comprises multiple LLM variants (gpt-4o, gpt-4o, GPT-5.4), retrievers, document counts, retrieval depths, and synthesis strategies (LLM-only, per-chunk summary, iterative retrieval). Baselines include static optimal configurations, LLM routing methods (Carrot), rule-based routing (METIS), retrieval strategy selection (Adaptive-RAG), and fine-tuned end-to-end models (BERT, Qwen3-4B). Evaluation metrics are accuracy (agreement with gpt-4o judged reference answers) and cost (monetary cost of token generation plus characterization overhead). Five-fold cross-validation was used, with Optuna hyperparameter tuning for predictors. Ablations examined the impact of feature set size, characterization LLM choice, and predictor model type.
Results
BRANE consistently outperforms all baselines across benchmarks, achieving up to 89.4% cost savings at matched accuracy and averaging 59.7% cost reduction. It dominates the cost-accuracy Pareto frontier, covering a broad cost range. LLM-proposed workload-specific binary features significantly outperform generic semantic embeddings as predictor inputs, enhancing prediction accuracy. Fuzzy Pareto pruning effectively reduces the number of predictors trained without degrading performance. BRANE uniquely attains strict accuracy targets with cost savings on all benchmarks, demonstrating strong generalization and practical utility.
Applications
BRANE is applicable to intelligent customer support systems, scientific literature retrieval assistants, and financial information analysis platforms requiring efficient and accurate knowledge retrieval. By dynamically adapting retrieval configurations per query, it enables systems to allocate resources effectively according to query complexity and budget constraints, improving response quality and reducing operational costs. Its modular design supports integration with various LLMs and retrievers, facilitating deployment in industrial-scale search architectures and promoting personalized, on-demand retrieval services.
Limitations & Outlook
BRANE depends on offline profiling of configuration-query pairs, incurring substantial upfront computational and monetary costs, which may hinder rapid deployment and iteration. The approach presumes stable query-feature to configuration-performance relationships; shifts in query distribution or corpus content necessitate retraining to maintain effectiveness. Current evaluations focus on English-language benchmarks; cross-lingual and cross-domain robustness require further study. Handling extremely complex or ambiguous queries remains an open challenge.
Plain Language Accessible to non-experts
Imagine a large restaurant kitchen with many different cooking tools and ingredient combinations. Traditionally, the kitchen uses a fixed recipe for every dish, regardless of the customer's order. This means some dishes are overcooked and waste ingredients, while others are underprepared and taste bland. BRANE acts like a smart ordering assistant that, based on the customer's dish request (natural language query), quickly figures out the best combination of tools and ingredients (retrieval pipeline configuration) to cook the dish perfectly while saving time and resources. It learns from past orders and cooking outcomes, extracting key features of each dish to predict which cooking method will succeed. This way, the restaurant can satisfy diverse customer tastes efficiently without wasting resources.
ELI14 Explained like you're 14
Hey! Imagine you're playing a super complex video game where each mission needs different gear and skills. Before, you always used the same gear for every mission, which sometimes made things too hard or wasted your game coins. BRANE is like your awesome game buddy who looks at each mission and tells you exactly which gear and skills to use so you can win easily and save coins! It learns what makes each mission unique and guesses which gear combo will work best. That way, you get better at the game and don’t waste your resources. Cool, right?
Glossary
Large Language Model (LLM)
A deep learning model trained on vast text data to understand and generate human language.
Used in BRANE to extract query features and judge answer correctness.
Retrieval Pipeline Configuration
A specific combination of LLM, retriever, document count, retrieval depth, and synthesis strategy in a knowledge retrieval system.
BRANE dynamically selects among these configurations per query.
Workload-Specific Query Characteristics
Binary features automatically generated by an LLM that capture semantic and structural aspects of queries within a workload.
Serve as inputs to predictors estimating configuration accuracy.
Lagrangian Relaxation
An optimization technique that transforms constrained problems into unconstrained ones using Lagrange multipliers.
Used in BRANE to balance cost and accuracy during configuration selection.
Fuzzy Pareto Pruning
A method to retain configurations near the Pareto frontier within tolerance thresholds to reduce training overhead.
Helps BRANE focus on promising configurations.
Accuracy
The proportion of queries for which the system produces a correct answer.
Predicted per configuration to guide selection.
Cost
The monetary expense associated with executing a query under a given configuration, typically measured in dollars.
Considered alongside accuracy in BRANE’s selection.
Multi-hop Reasoning
The process of combining information from multiple sources or steps to answer a query.
Included as a binary query feature in BRANE.
Knowledge Search Benchmark
Standard datasets used to evaluate knowledge retrieval systems, e.g., MuSiQue, BrowseComp-Plus, FinanceBench.
Used to validate BRANE’s performance.
Predictor
A lightweight binary classifier estimating the probability that a given configuration answers a query correctly.
Trained per configuration in BRANE.
Open Questions Unanswered questions from this research
- 1 How to enable BRANE to adapt online to dynamic workloads and evolving corpora without costly retraining?
- 2 How to reduce the expensive offline profiling phase to scale BRANE to extremely large configuration spaces?
- 3 How to design cross-lingual and cross-domain workload-specific query feature discovery for broader applicability?
- 4 How to ensure BRANE’s configuration predictions remain reliable for extremely complex or ambiguous queries?
- 5 How to incorporate multimodal information (images, videos) into BRANE’s query characterization and configuration selection?
Applications
Immediate Applications
Intelligent Customer Support
Dynamically selects retrieval configurations per query to improve answer accuracy and response speed while reducing cloud costs.
Scientific Literature Retrieval
Optimizes retrieval strategies based on query complexity to enhance relevance and efficiency in research assistants.
Financial Information Analysis
Adjusts retrieval depth and model strength dynamically to improve accuracy and timeliness in financial data platforms.
Long-term Vision
Personalized Knowledge Search Engines
Leverages user history and preferences to tailor retrieval configurations per query, enabling personalized and efficient search experiences.
Cross-Domain Multilingual QA Systems
Extends BRANE to support multiple languages and domains, facilitating global access to intelligent question answering.
Abstract
Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.
References (19)
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity
Soyeong Jeong, Jinheon Baek, Sukmin Cho et al.
CARROT: A Cost Aware Rate Optimal Router
Seamus Somerstep, Felipe Maia Polo, Allysson Flavio Melo de Oliveira et al.
Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms
G. Chaudhry, Esha Choukse, Haoran Qiu et al.
METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation
Siddhant Ray, Rui Pan, Zhuohan Gu et al.
FinanceBench: A New Benchmark for Financial Question Answering
Pranab Islam, Anand Kannappan, Douwe Kiela et al.
♫ MuSiQue: Multihop Questions via Single-hop Question Composition
H. Trivedi, Niranjan Balasubramanian, Tushar Khot et al.
OmniRouter: Budget and Performance Controllable Multi-LLM Routing
K. Mei, Wujiang Xu, Shuhang Lin et al.
RAG over Thinking Traces Can Improve Reasoning Tasks
Negar Arabzadeh, Wenjie Ma, Sewon Min et al.
RouterBench: A Benchmark for Multi-LLM Routing System
Qi Hu, J. Bieker, Xiuyu Li et al.
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
Zijian Chen, Xueguang Ma, Shengyao Zhuang et al.
The Distracting Effect: Understanding Irrelevant Passages in RAG
Chen Amiraz, Florin Cuconasu, Simone Filice et al.
RouteLLM: Learning to Route LLMs with Preference Data
Isaac Ong, Amjad Almahairi, Vincent Wu et al.
Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics
Akshara Prabhakar, Roshan Ram, Zixiang Chen et al.
R2-Router: A New Paradigm for LLM Routing with Reasoning
Jiaqi Xue, Qian Lou, Jiarong Xing et al.
The Power of Noise: Redefining Retrieval for RAG Systems
Florin Cuconasu, Giovanni Trappolini, F. Siciliano et al.
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Guolin Ke, Qi Meng, Thomas Finley et al.
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu et al.
XGBoost: A Scalable Tree Boosting System
Tianqi Chen, Carlos Guestrin
HedraRAG: Coordinating LLM Generation and Database Retrieval in Heterogeneous RAG Serving
Zhengding Hu, Vibha Murthy, Zaifeng Pan et al.