Constrained Dominant Sets for Multimodal Document Question Answering

TL;DR

Proposes a Constrained Dominant Set (CDS) method for multimodal long-document QA, achieving 66.99 on VisDoMBench, surpassing previous SOTA by 37.1 points.

cs.IR 🔴 Advanced 2026-06-05 61 views

Ambuj Mehrish Sebatiano Vascon

AI Reader Arxiv Page Download PDF

multimodal retrieval graph algorithms dominant sets question answering long document understanding

Key Findings

Methodology

This paper introduces a novel evidence retrieval approach based on Constrained Dominant Sets (CDS) applied to a query-augmented affinity graph. The method constructs a multimodal graph where nodes represent document elements such as text chunks, figures, and tables, with edges encoding structural and semantic relations. The query is incorporated as a hard constraint node, ensuring the selected evidence set is directly relevant to the question. The affinity matrix combines query–node similarity and node–node dissimilarity, forming a quadratic optimization problem. Using replicator dynamics, the method iteratively updates soft membership scores for nodes, converging to a global equilibrium that balances relevance and diversity. This approach is training-free, leveraging graph structure and spectral bounds to automatically tune the relevance-diversity trade-off, avoiding greedy heuristics. The process yields a compact, mutually coherent evidence subset, which is then passed to a vision-language model (VLM) for answer generation. The entire pipeline emphasizes global optimality, interpretability, and efficiency, making it suitable for long, complex multimodal documents.

Key Results

On VisDoMBench, using Qwen3-VL-32B, the CDS method achieved an average accuracy of 66.99, outperforming previous top systems such as G2-Reader (66.21) and VisDoMRAG (65.01), with a +37.1 point improvement over no-retrieval baselines, demonstrating its effectiveness in complex multimodal QA tasks.
In the MMLongBench-Doc dataset, the approach with Qwen3-VL-32B attained 45.01% accuracy, surpassing the no-retrieval baseline (40.19%) by 4.82 points, validating its robustness in single long-document scenarios.
Ablation studies comparing different retrieval strategies (MMR, DPP, PPR) show that CDS consistently yields superior relevance-diversity balance, leading to higher answer accuracy across multiple datasets, especially in scenarios with high redundancy and complex multimodal relations.

Significance

This research addresses a fundamental bottleneck in long-document multimodal question answering: effective evidence selection amidst redundancy. By integrating graph-based global optimization with query constraints, the method significantly enhances the quality of retrieved evidence, directly translating into improved answer accuracy. It bridges the gap between structural graph theories and practical retrieval tasks, offering a scalable, training-free solution that can be readily integrated into existing systems. The approach’s ability to automatically balance relevance and diversity, without manual tuning, marks a substantial advancement in the field, promising broader applications in scientific literature review, legal document analysis, and enterprise knowledge management. Its emphasis on interpretability and efficiency aligns well with the demands of real-world deployment, paving the way for more intelligent, autonomous document understanding systems.

Technical Contribution

The core technical innovation lies in adapting the Constrained Dominant Set framework to the domain of multimodal long-document retrieval. Unlike traditional similarity ranking, the method formulates evidence selection as a quadratic optimization problem constrained by the query node, solved via replicator dynamics—an algorithm with theoretical guarantees of convergence to a global maximum. This approach inherently balances relevance and diversity without hyperparameter tuning, leveraging spectral bounds to automatically determine the trade-off. The graph construction integrates heterogeneous modalities into a unified affinity matrix, combining structural cues and semantic relations verified by vision-language models. The entire pipeline is training-free, relying solely on graph optimization, which enhances interpretability and computational efficiency. This represents a significant departure from prior heuristic or learned ranking methods, providing a theoretically grounded, scalable solution for complex multimodal retrieval tasks.

Novelty

This work is the first to apply the constrained dominant set approach to long-form multimodal document retrieval, integrating query constraints directly into a graph-based global optimization framework. Unlike previous methods that depend on greedy heuristics or learned similarity scores, this approach guarantees a balanced, optimal subset of evidence by solving a quadratic program with spectral bounds. The use of replicator dynamics as a solver for this problem is novel in the context of natural language processing and multimodal retrieval, offering a principled way to incorporate relevance and diversity simultaneously. The method’s training-free nature, combined with its ability to handle complex, heterogeneous graph structures, sets it apart from existing state-of-the-art systems, leading to superior performance on benchmarks like VisDoMBench.

Limitations

The effectiveness of the approach heavily depends on the quality of graph construction, particularly the accuracy of structural and semantic links; errors in graph modeling could impair retrieval performance.
Scalability may become an issue with extremely large or highly complex documents, as the quadratic optimization and replicator dynamics computations increase in complexity, potentially affecting real-time applicability.
Currently, the method is not integrated into an end-to-end trainable pipeline, limiting its adaptability and potential for joint optimization with downstream tasks such as answer generation. Future work should focus on embedding this graph optimization within learnable frameworks to enhance robustness and generalization.

Future Work

Future directions include integrating the CDS framework into end-to-end trainable models, enabling joint learning of graph construction and optimization parameters. Extending the method to handle dynamic or streaming documents, where content evolves over time, is another promising avenue. Additionally, exploring more sophisticated graph neural network architectures to enhance relation modeling, and combining this approach with reinforcement learning for adaptive evidence selection, could further improve performance. Broader application to real-world scenarios, such as legal case analysis or scientific literature review, will also be pursued, aiming to develop scalable, interpretable, and robust multimodal retrieval systems.

AI Executive Summary

Long, complex documents—such as scientific papers, legal texts, and enterprise reports—pose a significant challenge for question answering systems. Existing retrieval methods primarily rely on similarity ranking, which often leads to redundant evidence selection, especially in lengthy documents where key information recurs across figures, captions, and paragraphs. This redundancy hampers the ability of models to access diverse, relevant evidence, ultimately limiting answer accuracy and interpretability.

To address this, the paper introduces a novel graph-based evidence retrieval framework grounded in Constrained Dominant Sets (CDS). The core idea is to model document elements as nodes in a multimodal graph, where edges encode structural and semantic relationships. The query is incorporated as a hard constraint node, ensuring the selected evidence set remains directly relevant to the question. The affinity matrix combines query–node similarity and node–node dissimilarity, capturing both relevance and diversity. The evidence selection problem is formulated as a quadratic optimization task, which is solved using replicator dynamics—an iterative algorithm inspired by evolutionary game theory. This approach guarantees convergence to a global maximum, automatically balancing relevance and diversity without manual parameter tuning.

The significance of this work lies in its ability to automatically produce a mutually coherent, diverse evidence subset that is tightly anchored to the query. Extensive experiments on benchmarks such as VisDoMBench demonstrate that the proposed method achieves a new state-of-the-art average accuracy of 66.99, outperforming previous best systems by a substantial margin. The approach also shows robustness in single long-document scenarios, with consistent improvements over baseline retrieval strategies.

This research advances the field by bridging graph optimization techniques with multimodal information retrieval, providing a training-free, interpretable, and scalable solution. Its implications extend to scientific literature review, legal analysis, and enterprise knowledge management, where efficient and accurate evidence retrieval is critical. Looking ahead, integrating this graph-based optimization into end-to-end trainable models and exploring more complex multimodal relations will further enhance the capabilities of intelligent document understanding systems. Overall, this work marks a significant step toward more autonomous, precise, and comprehensive long-form multimodal question answering.

Deep Dive

Abstract

Long multimodal document question answering is limited by which evidence reaches the reader, rather than by the quantity retrieved. In lengthy documents, findings often recur across figures, captions, and introductory sentences, causing similarity based retrievers in modern multimodal retrieval-augmented generation (RAG) systems to allocate resources to near-duplicates while overlooking complementary evidence. This work introduces a retriever that selects evidence as a Constrained Dominant Set (CDS) on a query-augmented affinity graph, offering three advantages that similarity ranking does not. First, the query is encoded as a hard structural constraint, ensuring that every selected element is directly connected to the question through the cluster anchor. Second, the relevance-redundancy balance is determined automatically by a spectral bound, eliminating the need for manually tuned trade offs required by diversity-aware selectors. Third, the selection process achieves a global equilibrium via replicator dynamics, thereby avoiding the distortions introduced by greedy heuristics. The method is inherently graph-based and does not require training. Using a Qwen3-VL-32B reader, CDS establishes a new state of the art on VisDoMBench ($66.99$ average) and improves over the no-retrieval baseline by $37.1$ points on VisDoMBench and $4.8$ on MMLongBench-Doc.

cs.IR

References (20)

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

Manan Suri, Puneet Mathur, Franck Dernoncourt et al.

2024 35 citations ⭐ Influential View Analysis →

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

Qiuchen Wang, Ruixue Ding, Zehui Chen et al.

2025 56 citations ⭐ Influential View Analysis →

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen et al.

2024 139 citations ⭐ Influential View Analysis →

DocVQA: A Dataset for VQA on Document Images

Minesh Mathew, Dimosthenis Karatzas, R. Manmatha et al.

2020 1425 citations View Analysis →

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu et al.

2024 150 citations View Analysis →

k-DPPs: Fixed-Size Determinantal Point Processes

Alex Kulesza, B. Taskar

2011 315 citations

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Chaochao Xu, Xiaomeng Zhao et al.

2024 243 citations View Analysis →

Dominant-set clustering: A review

S. R. Bulò, M. Pelillo

2017 48 citations

Ieee Transactions on Pattern Analysis and Machine Intelligence Large-scale Image Geo-localization Using Dominant Sets

Eyasu Zemene

47 citations

Qwen2.5-VL Technical Report

Shuai Bai, Ke-qin Chen, Xuejing Liu et al.

2025 4830 citations View Analysis →

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

Shraman Pramanick, R. Chellappa, Subhashini Venugopalan

2024 88 citations View Analysis →

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng et al.

2024 1549 citations View Analysis →

MemoryBank: Enhancing Large Language Models with Long-Term Memory

Wanjun Zhong, Lianghong Guo, Qi-Fei Gao et al.

2023 483 citations View Analysis →

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

Chao Deng, Jiale Yuan, Pi Bu et al.

2024 51 citations View Analysis →

Evolution towards the Maximum Clique

I. Bomze

1997 178 citations

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.

2023 8986 citations View Analysis →

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil et al.

2023 769 citations View Analysis →

Multi-target Tracking in Multiple Non-overlapping Cameras Using Fast-Constrained Dominant Sets

Yonatan Tariku Tesfaye, Eyasu Zemene, A. Prati et al.

2019 59 citations

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 6386 citations View Analysis →

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

Yew Ken Chia, Liying Cheng, Hou Pong Chan et al.

2024 32 citations View Analysis →

Constrained Dominant Sets for Multimodal Document Question Answering

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

References (20)

Related Papers

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

Do Generative Recommenders Deepen the Information Cocoon? A Closed-Loop Simulation with LLM-powered User Simulators

A Theoretical Framework for Risk Analysis of Stochastic Rankers

CQC-RAG: Robust Retrieval-Augmented Generation via Cross-Query Consistency

miniReranker: Efficient Multimodal Reranking through Visual Cache Reuse and Interaction Sparsity