Query-aware Routing for Filtered Approximate Nearest Neighbors Search - Paper Insights

Key Findings

Methodology

This paper introduces a machine learning-based routing system employing lightweight multi-layer perceptrons (MLPs) to predict the recall of candidate algorithms on a per-query basis. The approach involves constructing an offline benchmark table by exhaustively testing candidate algorithms across multiple datasets, predicate types, and parameter settings, recording their recall and QPS. During online query processing, features such as query selectivity, dataset local intrinsic dimensionality (LID), and predicate type are extracted and fed into trained regression models to predict each algorithm’s recall. The system then filters algorithms based on a user-defined recall threshold, consulting the benchmark table to select the method and parameter setting with the highest QPS that meets the threshold. This dynamic selection allows the system to adaptively choose the best algorithm per query, balancing recall and throughput effectively.

Key Results

Experimental evaluations on six real-world datasets, including YFCC and LAION-1M, demonstrate that the proposed router consistently outperforms static baseline algorithms such as UNG, SIEVE, and Post-filter in terms of recall and QPS. The model achieves an average recall improvement of over 5%, with QPS gains around 10%, while incurring negligible latency overhead. The system adapts to different predicate types (containment, overlap, equality) and data complexities, maintaining robust performance across diverse scenarios.
A feature ablation study confirms that only three features—query selectivity, dataset LID, and predicate type—are sufficient to preserve high prediction accuracy, reducing overfitting and model complexity. The regression approach, predicting recall directly, provides more stable and nuanced guidance than classification models, especially when multiple algorithms have similar performance levels.
Validation on unseen datasets shows that the system generalizes well, dynamically selecting algorithms that achieve near-optimal recall with minimal latency. The approach effectively handles high-dimensional, label-sparse, and complex data environments, demonstrating its practical utility for large-scale vector search systems.

Significance

This research addresses a critical bottleneck in modern vector databases: the challenge of selecting the most suitable approximate nearest neighbor (ANN) algorithm for a given query. Traditional methods rely on static rules or single algorithms, which often fail to adapt to query-specific variations, leading to suboptimal performance. By integrating machine learning predictions into the algorithm selection process, this work enables real-time, query-aware scheduling that optimizes both recall and throughput. The approach significantly enhances the efficiency and flexibility of vector search systems, making them more suitable for real-world applications such as recommendation engines, semantic search, and retrieval-augmented generation. Its ability to generalize across datasets and predicate types marks a substantial advancement in the field.

Technical Contribution

The core technical innovation lies in designing a lightweight regression-based ML router that predicts the recall of candidate algorithms based on minimal query and dataset features. The system employs a minimal set of three features—query selectivity, dataset LID, and predicate type—selected through systematic ablation to maximize generalization and reduce overfitting. Each candidate algorithm has an independent MLP regressor trained to output predicted recall, enabling flexible, query-specific algorithm scheduling. The offline benchmark table records comprehensive performance metrics across multiple datasets, predicate types, and parameter settings, facilitating informed selection during online inference. The system balances accuracy and latency by keeping models shallow and efficient, ensuring real-time responsiveness. This integrated approach surpasses existing static or rule-based methods, providing a scalable, adaptive solution for filtered ANN search.

Novelty

This work is the first to employ a regression-based machine learning model for per-query algorithm performance prediction in filtered ANN search, moving beyond traditional static rules or classification models. The combination of minimal feature selection, offline benchmarking, and lightweight inference enables dynamic, query-specific algorithm scheduling with high accuracy and low latency. Unlike prior approaches that rely on dataset-level heuristics or fixed algorithms, this framework adapts to query-specific characteristics, significantly improving recall and QPS trade-offs. Its ability to generalize across unseen datasets and predicate types demonstrates a novel integration of offline benchmarking and real-time prediction, setting a new standard for intelligent, adaptive vector search systems.

Limitations

The system depends heavily on the accuracy of the offline benchmark table; significant distribution shifts or unseen data characteristics may reduce prediction reliability, necessitating periodic re-benchmarking.
While the three selected features are effective, they may not capture all nuances of complex query scenarios, especially in highly dynamic or evolving datasets, potentially limiting performance.
Despite the lightweight design, the inference process adds some latency, which could become critical in ultra-low latency applications or extremely high concurrency environments. Further optimization and hardware acceleration may be required for deployment at scale.

Future Work

Future research could focus on developing adaptive benchmark updating mechanisms to maintain prediction accuracy in dynamic environments. Incorporating additional features or employing deep learning models might further improve prediction fidelity. Exploring reinforcement learning for continuous policy optimization and extending the framework to multi-objective scenarios, such as balancing latency, energy consumption, and recall, are promising directions. Additionally, integrating this approach with distributed systems and hardware accelerators could facilitate deployment in real-time, large-scale vector search platforms.

AI Executive Summary

In the era of big data and deep learning, high-dimensional vector representations have become foundational for numerous AI applications, including semantic search, recommendation systems, and knowledge retrieval. As datasets grow exponentially, efficient and accurate approximate nearest neighbor (ANN) search algorithms are essential to enable real-time retrieval. Traditional ANN methods like HNSW, Vamana, and others have achieved remarkable success in high-dimensional spaces; however, their performance often depends heavily on dataset characteristics and query types. Moreover, in practical scenarios, queries are rarely solely about vector similarity—they often include attribute filters such as categories, time ranges, or geographic constraints. This attribute filtering, known as filtered ANN, introduces additional complexity, making the choice of the most suitable algorithm a non-trivial problem.

Existing solutions typically rely on static rules or single algorithms, which cannot adapt to the diverse and dynamic nature of real-world queries. This leads to suboptimal performance, with some queries achieving high recall at the expense of throughput, and vice versa. Recognizing this challenge, the authors propose a novel query-aware routing framework that leverages machine learning to dynamically select the best algorithm for each query. The core idea is to predict the recall performance of candidate algorithms based on minimal query and dataset features, enabling real-time decision-making that balances recall and QPS.

The system employs lightweight multi-layer perceptrons (MLPs) trained via regression to forecast each algorithm’s recall, using just three key features: query selectivity, dataset local intrinsic dimensionality (LID), and predicate type. An offline benchmark table records the performance of all candidate algorithms across multiple datasets and parameter settings, serving as a reference during online inference. When a query arrives, features are extracted, predicted recall values are obtained, and algorithms meeting a user-defined recall threshold are filtered. Among these, the one with the highest QPS, as per the benchmark, is selected for execution. This approach allows the system to adaptively tune algorithm choices on a per-query basis, significantly improving the recall-QPS trade-off.

Extensive experiments on six real-world datasets—including YFCC, LAION-1M, and TripClick—demonstrate the effectiveness of the proposed framework. Results show consistent improvements over static baselines, with an average recall increase of over 5% and QPS gains of approximately 10%. The model’s ability to generalize to unseen datasets and predicate types underscores its robustness and practical value. The minimal feature set reduces overfitting, while the lightweight inference ensures low latency, making the system suitable for large-scale, real-time applications.

This work marks a significant step forward in the field of vector search, introducing a flexible, intelligent, and scalable solution for filtered ANN. By combining offline benchmarking, real-time prediction, and adaptive algorithm scheduling, it addresses long-standing performance bottlenecks and opens new avenues for research in AI-driven data retrieval systems. Future directions include enhancing model adaptability, incorporating multi-objective optimization, and deploying in distributed environments to meet the demands of next-generation AI applications.

Deep Dive

Plain Language Accessible to non-experts

Imagine you’re in a huge kitchen with many different chefs, each specialized in making certain types of dishes. When a new order comes in, you need to quickly decide which chef to assign so that the dish is prepared fast and tastes great. In the past, you might have used simple rules like: if the order is for a sweet dessert, pick the dessert chef; if it’s spicy, pick the spicy chef. But these rules are too rigid—they don’t always work well because some chefs are better at certain dishes depending on the specific ingredients or complexity.

Now, imagine you have a smart assistant that has learned from many past orders. This assistant looks at the details of the new order—how complicated it is, what ingredients are involved, how many people want it—and then predicts how well each chef will do. Based on these predictions, the assistant picks the best chef for that particular order, balancing speed and quality. This way, the kitchen runs more smoothly, orders are completed faster, and customers are happier.

This smart assistant is like the machine learning model in the paper. Instead of fixed rules, it learns from past data to make better decisions for each new task. It looks at just a few key features—like how complex the order is, how many ingredients are involved, and what type of dish it is—and then predicts which chef will do the best job. This approach makes the whole kitchen more efficient and flexible, able to handle all kinds of orders without getting overwhelmed or making mistakes.

ELI14 Explained like you're 14

Imagine you’re in a giant school cafeteria with dozens of different chefs, each really good at certain kinds of food—some are great at sandwiches, others at salads, and some at desserts. When a new student places an order, you want to quickly pick the best chef so the food is ready fast and tastes awesome. In the past, you might have used simple rules like: if the student wants a sweet treat, pick the dessert chef; if they want a healthy salad, pick the salad chef. But these rules are too simple—they don’t always work because sometimes a chef who’s good at desserts might also be good at other things depending on the ingredients.

Now, imagine you have a smart helper who remembers all the past orders and how well each chef did. When a new order comes in, this helper looks at details like how complicated the order is, what ingredients are involved, and how many students want it. Then, it predicts which chef will do the best job for this specific order. Based on that prediction, it assigns the order to the most suitable chef. This way, the cafeteria runs more smoothly, orders are prepared faster, and everyone leaves happy.

This helper is like the machine learning model in the paper. Instead of fixed rules, it learns from past experience to make smarter choices. It looks at just a few simple things about each order—like how complex it is, what kind of food it is, and how many people want it—and then predicts which chef will do the best. This makes the whole process more flexible, faster, and better at handling all kinds of orders, just like a well-organized kitchen or cafeteria!

Abstract

Filtered ANN search, which combines vector similarity with attribute predicates, is a core primitive in modern vector databases and retrieval-augmented generation. We benchmark all major categorical filtered ANN methods across multiple datasets under three predicates and find that no single method dominates. Moreover, even within a single dataset and predicate type, the best method for a query can vary. Therefore, we propose a query-aware routing framework. A lightweight ML model predicts each candidate method's recall on the query, and the router consults an offline benchmark table that maps every method and parameter setting to its measured recall and QPS, then selects the method with the best recall--QPS trade-off. Our ablation study narrows 22 candidate features to a minimal set of three and we adopt regression rather than classification as the prediction target to sharpen accuracy. Our model is trained on six real-world datasets and applied to five unseen validation datasets. The final result shows that our router achieves state-of-the-art recall and QPS balance across all five validation datasets compared to existing filtered ANN baselines, while incurring negligible latency overhead.

cs.DB cs.IR

Related Papers

Implementation and Privacy Guarantees for Scalable Keyword Search on SOLID-based Decentralized Data with Granular Visibility Constraints

ESPRESSO framework implements scalable keyword search with granular visibility constraints and privacy guarantees in Solid-based environments.

cs.DB 2026-04-24