Constrained Dominant Sets for Multimodal Document Question Answering
Proposes a Constrained Dominant Set (CDS) method for multimodal long-document QA, achieving 66.99 on VisDoMBench, surpassing previous SOTA by 37.1 points.
Key Findings
Methodology
This paper introduces a novel evidence retrieval approach based on Constrained Dominant Sets (CDS) applied to a query-augmented affinity graph. The method constructs a multimodal graph where nodes represent document elements such as text chunks, figures, and tables, with edges encoding structural and semantic relations. The query is incorporated as a hard constraint node, ensuring the selected evidence set is directly relevant to the question. The affinity matrix combines query–node similarity and node–node dissimilarity, forming a quadratic optimization problem. Using replicator dynamics, the method iteratively updates soft membership scores for nodes, converging to a global equilibrium that balances relevance and diversity. This approach is training-free, leveraging graph structure and spectral bounds to automatically tune the relevance-diversity trade-off, avoiding greedy heuristics. The process yields a compact, mutually coherent evidence subset, which is then passed to a vision-language model (VLM) for answer generation. The entire pipeline emphasizes global optimality, interpretability, and efficiency, making it suitable for long, complex multimodal documents.
Key Results
- On VisDoMBench, using Qwen3-VL-32B, the CDS method achieved an average accuracy of 66.99, outperforming previous top systems such as G2-Reader (66.21) and VisDoMRAG (65.01), with a +37.1 point improvement over no-retrieval baselines, demonstrating its effectiveness in complex multimodal QA tasks.
- In the MMLongBench-Doc dataset, the approach with Qwen3-VL-32B attained 45.01% accuracy, surpassing the no-retrieval baseline (40.19%) by 4.82 points, validating its robustness in single long-document scenarios.
- Ablation studies comparing different retrieval strategies (MMR, DPP, PPR) show that CDS consistently yields superior relevance-diversity balance, leading to higher answer accuracy across multiple datasets, especially in scenarios with high redundancy and complex multimodal relations.
Significance
This research addresses a fundamental bottleneck in long-document multimodal question answering: effective evidence selection amidst redundancy. By integrating graph-based global optimization with query constraints, the method significantly enhances the quality of retrieved evidence, directly translating into improved answer accuracy. It bridges the gap between structural graph theories and practical retrieval tasks, offering a scalable, training-free solution that can be readily integrated into existing systems. The approach’s ability to automatically balance relevance and diversity, without manual tuning, marks a substantial advancement in the field, promising broader applications in scientific literature review, legal document analysis, and enterprise knowledge management. Its emphasis on interpretability and efficiency aligns well with the demands of real-world deployment, paving the way for more intelligent, autonomous document understanding systems.
Technical Contribution
The core technical innovation lies in adapting the Constrained Dominant Set framework to the domain of multimodal long-document retrieval. Unlike traditional similarity ranking, the method formulates evidence selection as a quadratic optimization problem constrained by the query node, solved via replicator dynamics—an algorithm with theoretical guarantees of convergence to a global maximum. This approach inherently balances relevance and diversity without hyperparameter tuning, leveraging spectral bounds to automatically determine the trade-off. The graph construction integrates heterogeneous modalities into a unified affinity matrix, combining structural cues and semantic relations verified by vision-language models. The entire pipeline is training-free, relying solely on graph optimization, which enhances interpretability and computational efficiency. This represents a significant departure from prior heuristic or learned ranking methods, providing a theoretically grounded, scalable solution for complex multimodal retrieval tasks.
Novelty
This work is the first to apply the constrained dominant set approach to long-form multimodal document retrieval, integrating query constraints directly into a graph-based global optimization framework. Unlike previous methods that depend on greedy heuristics or learned similarity scores, this approach guarantees a balanced, optimal subset of evidence by solving a quadratic program with spectral bounds. The use of replicator dynamics as a solver for this problem is novel in the context of natural language processing and multimodal retrieval, offering a principled way to incorporate relevance and diversity simultaneously. The method’s training-free nature, combined with its ability to handle complex, heterogeneous graph structures, sets it apart from existing state-of-the-art systems, leading to superior performance on benchmarks like VisDoMBench.
Limitations
- The effectiveness of the approach heavily depends on the quality of graph construction, particularly the accuracy of structural and semantic links; errors in graph modeling could impair retrieval performance.
- Scalability may become an issue with extremely large or highly complex documents, as the quadratic optimization and replicator dynamics computations increase in complexity, potentially affecting real-time applicability.
- Currently, the method is not integrated into an end-to-end trainable pipeline, limiting its adaptability and potential for joint optimization with downstream tasks such as answer generation. Future work should focus on embedding this graph optimization within learnable frameworks to enhance robustness and generalization.
Future Work
Future directions include integrating the CDS framework into end-to-end trainable models, enabling joint learning of graph construction and optimization parameters. Extending the method to handle dynamic or streaming documents, where content evolves over time, is another promising avenue. Additionally, exploring more sophisticated graph neural network architectures to enhance relation modeling, and combining this approach with reinforcement learning for adaptive evidence selection, could further improve performance. Broader application to real-world scenarios, such as legal case analysis or scientific literature review, will also be pursued, aiming to develop scalable, interpretable, and robust multimodal retrieval systems.
AI Executive Summary
Long, complex documents—such as scientific papers, legal texts, and enterprise reports—pose a significant challenge for question answering systems. Existing retrieval methods primarily rely on similarity ranking, which often leads to redundant evidence selection, especially in lengthy documents where key information recurs across figures, captions, and paragraphs. This redundancy hampers the ability of models to access diverse, relevant evidence, ultimately limiting answer accuracy and interpretability.
To address this, the paper introduces a novel graph-based evidence retrieval framework grounded in Constrained Dominant Sets (CDS). The core idea is to model document elements as nodes in a multimodal graph, where edges encode structural and semantic relationships. The query is incorporated as a hard constraint node, ensuring the selected evidence set remains directly relevant to the question. The affinity matrix combines query–node similarity and node–node dissimilarity, capturing both relevance and diversity. The evidence selection problem is formulated as a quadratic optimization task, which is solved using replicator dynamics—an iterative algorithm inspired by evolutionary game theory. This approach guarantees convergence to a global maximum, automatically balancing relevance and diversity without manual parameter tuning.
The significance of this work lies in its ability to automatically produce a mutually coherent, diverse evidence subset that is tightly anchored to the query. Extensive experiments on benchmarks such as VisDoMBench demonstrate that the proposed method achieves a new state-of-the-art average accuracy of 66.99, outperforming previous best systems by a substantial margin. The approach also shows robustness in single long-document scenarios, with consistent improvements over baseline retrieval strategies.
This research advances the field by bridging graph optimization techniques with multimodal information retrieval, providing a training-free, interpretable, and scalable solution. Its implications extend to scientific literature review, legal analysis, and enterprise knowledge management, where efficient and accurate evidence retrieval is critical. Looking ahead, integrating this graph-based optimization into end-to-end trainable models and exploring more complex multimodal relations will further enhance the capabilities of intelligent document understanding systems. Overall, this work marks a significant step toward more autonomous, precise, and comprehensive long-form multimodal question answering.
Deep Dive
Abstract
Long multimodal document question answering is limited by which evidence reaches the reader, rather than by the quantity retrieved. In lengthy documents, findings often recur across figures, captions, and introductory sentences, causing similarity based retrievers in modern multimodal retrieval-augmented generation (RAG) systems to allocate resources to near-duplicates while overlooking complementary evidence. This work introduces a retriever that selects evidence as a Constrained Dominant Set (CDS) on a query-augmented affinity graph, offering three advantages that similarity ranking does not. First, the query is encoded as a hard structural constraint, ensuring that every selected element is directly connected to the question through the cluster anchor. Second, the relevance-redundancy balance is determined automatically by a spectral bound, eliminating the need for manually tuned trade offs required by diversity-aware selectors. Third, the selection process achieves a global equilibrium via replicator dynamics, thereby avoiding the distortions introduced by greedy heuristics. The method is inherently graph-based and does not require training. Using a Qwen3-VL-32B reader, CDS establishes a new state of the art on VisDoMBench ($66.99$ average) and improves over the no-retrieval baseline by $37.1$ points on VisDoMBench and $4.8$ on MMLongBench-Doc.
References (20)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
Manan Suri, Puneet Mathur, Franck Dernoncourt et al.
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
Qiuchen Wang, Ruixue Ding, Zehui Chen et al.
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
Yubo Ma, Yuhang Zang, Liangyu Chen et al.
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew, Dimosthenis Karatzas, R. Manmatha et al.
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu et al.
k-DPPs: Fixed-Size Determinantal Point Processes
Alex Kulesza, B. Taskar
MinerU: An Open-Source Solution for Precise Document Content Extraction
Bin Wang, Chaochao Xu, Xiaomeng Zhao et al.
Dominant-set clustering: A review
S. R. Bulò, M. Pelillo
Ieee Transactions on Pattern Analysis and Machine Intelligence Large-scale Image Geo-localization Using Dominant Sets
Eyasu Zemene
Qwen2.5-VL Technical Report
Shuai Bai, Ke-qin Chen, Xuejing Liu et al.
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Shraman Pramanick, R. Chellappa, Subhashini Venugopalan
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng et al.
MemoryBank: Enhancing Large Language Models with Long-Term Memory
Wanjun Zhong, Lianghong Guo, Qi-Fei Gao et al.
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
Chao Deng, Jiale Yuan, Pi Bu et al.
Evolution towards the Maximum Clique
I. Bomze
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Vivian Fang, Shishir G. Patil et al.
Multi-target Tracking in Multiple Non-overlapping Cameras Using Fast-Constrained Dominant Sets
Yonatan Tariku Tesfaye, Eyasu Zemene, A. Prati et al.
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
Yew Ken Chia, Liying Cheng, Hou Pong Chan et al.