Document-as-Image Representations Fall Short for Scientific Retrieval
Document-as-image representations underperform in scientific retrieval; interleaved text+image representations are more effective.
Key Findings
Methodology
This paper introduces a new benchmark, ArXivDoc, to analyze the effectiveness of different representation methods in scientific document retrieval. By constructing documents from the LaTeX sources of scientific papers, the study examines the performance of text, image, and multimodal representations in both single-vector and multi-vector retrieval models. A systematic comparison reveals that interleaved text+image representations outperform document-as-image representations without requiring specialized training.
Key Results
- Result 1: Document-as-image representations consistently underperform across all query types, especially as document length increases. Even for image-based queries, text representations with VLM-generated captions perform better.
- Result 2: Text representations excel in image-based queries by leveraging captions and surrounding text context.
- Result 3: Interleaved text+image representations outperform document-as-image approaches without requiring specialized training, indicating that combining modalities is more robust than relying solely on rendered pages.
Significance
This study challenges the prevailing paradigm of document-as-image representations in scientific document retrieval, highlighting the advantages of text and multimodal representations in handling structured scientific documents. The findings have significant implications for both academia and industry, particularly in applications requiring precise retrieval and analysis of complex document content.
Technical Contribution
Technical contributions include the introduction of a new benchmark, ArXivDoc, providing a systematic comparison of text, image, and multimodal representations. The study demonstrates that interleaved text+image representations outperform document-as-image representations without requiring specialized training, revealing the potential of multimodal integration.
Novelty
This research is the first to systematically compare text, image, and multimodal representations in scientific document retrieval. Unlike existing work, it emphasizes the advantages of text and multimodal representations in handling structured scientific documents.
Limitations
- Limitation 1: The study focuses primarily on scientific documents, which may not be applicable to other types of documents.
- Limitation 2: The models and datasets used in the experiments may limit the generalizability of the results.
- Limitation 3: The specific needs of documents from different fields were not considered.
Future Work
Future directions include expanding the ArXivDoc benchmark to cover more domains and document types, and developing more advanced multimodal models to further enhance the performance of scientific document retrieval.
AI Executive Summary
In the field of scientific document retrieval, the traditional approach of using document-as-image representations faces challenges. Existing benchmarks, such as ArXivQA and ViDoRe, often treat documents as images of pages, which perform poorly when handling text-rich multimodal scientific documents. This paper introduces a new benchmark, ArXivDoc, constructed from the LaTeX sources of scientific papers, providing a systematic comparison of text, image, and multimodal representations.
The introduction of the ArXivDoc benchmark allows researchers to directly access the structured elements of scientific documents, such as sections, tables, figures, and equations, enabling precise query construction based on specific evidence types. By systematically comparing text, image, and multimodal representations, the study finds that interleaved text+image representations outperform document-as-image representations without requiring specialized training.
Experimental results show that document-as-image representations consistently underperform across all query types, especially as document length increases. Even for image-based queries, text representations excel by leveraging captions and surrounding text context. Furthermore, interleaved text+image representations outperform document-as-image approaches without requiring specialized training, indicating that combining modalities is more robust than relying solely on rendered pages.
This research has significant implications for both academia and industry, particularly in applications requiring precise retrieval and analysis of complex document content. It challenges the prevailing paradigm of document-as-image representations in scientific document retrieval, highlighting the advantages of text and multimodal representations in handling structured scientific documents.
Future research directions include expanding the ArXivDoc benchmark to cover more domains and document types, and developing more advanced multimodal models to further enhance the performance of scientific document retrieval.
Deep Analysis
Background
Scientific document retrieval is a crucial area of research in information retrieval, aiming to locate evidence relevant to a query from a large collection of documents. Traditional retrieval systems often represent documents as plain text or images. However, scientific documents typically contain rich multimodal information, such as text, tables, figures, and equations, distributed in a structured manner. Recently, with the advancement of vision-language models (VLMs), there has been a growing interest in representing documents as images. However, this approach performs poorly when handling text-rich multimodal scientific documents, as it obscures the structural information of the document.
Core Problem
The core problem is how to effectively represent and retrieve multimodal information in scientific documents. Existing methods often treat documents as images, which perform poorly when handling long documents and text-rich content. Moreover, this approach requires models to infer content boundaries and relationships, increasing the complexity of retrieval. Therefore, there is a need for a representation method that preserves the structural information of documents to improve retrieval performance.
Innovation
The core innovation of this paper lies in introducing a new benchmark, ArXivDoc, to analyze the effectiveness of different representation methods in scientific document retrieval. Specific innovations include:
1. Constructing documents from LaTeX sources, directly accessing structured elements such as sections, tables, figures, and equations.
2. Systematically comparing text, image, and multimodal representations in both single-vector and multi-vector retrieval models.
3. Demonstrating that interleaved text+image representations outperform document-as-image representations without requiring specialized training.
Methodology
The methodology of this paper includes the following steps:
- �� Dataset Construction: Construct the ArXivDoc benchmark from the LaTeX sources of scientific papers, containing 8,210 documents and 547 manually verified queries.
- �� Representation Comparison: Systematically compare text, image, and multimodal representations in both single-vector and multi-vector retrieval models.
- �� Experimental Design: Use multiple embedding models to evaluate the retrieval performance of different representations, using nDCG@10 as the evaluation metric.
- �� Results Analysis: Analyze the performance of different representations in text, table, and figure queries, revealing the advantages of interleaved text+image representations.
Experiments
The experimental design includes evaluating the retrieval performance of different representation methods using the ArXivDoc benchmark. The experiments use multiple embedding models, including text and image embedding models. Retrieval performance is evaluated using nDCG@10, and the results show that interleaved text+image representations outperform document-as-image representations without requiring specialized training. Additionally, the experiments analyze the performance of different representations in text, table, and figure queries.
Results
Experimental results show that document-as-image representations consistently underperform across all query types, especially as document length increases. Even for image-based queries, text representations excel by leveraging captions and surrounding text context. Furthermore, interleaved text+image representations outperform document-as-image approaches without requiring specialized training, indicating that combining modalities is more robust than relying solely on rendered pages.
Applications
The applications of this research include precise retrieval and analysis of scientific documents, particularly in applications requiring handling of complex document content. By preserving the structural information of documents, interleaved text+image representations can improve retrieval performance, applicable to academic research and industrial applications.
Limitations & Outlook
The limitations of this paper include:
1. The study focuses primarily on scientific documents, which may not be applicable to other types of documents.
2. The models and datasets used in the experiments may limit the generalizability of the results.
3. The specific needs of documents from different fields were not considered. Future research can expand the ArXivDoc benchmark to cover more domains and document types.
Plain Language Accessible to non-experts
Imagine you're in a library trying to find a book on a specific topic. The traditional method is to judge the content by the book's cover, which is like representing documents as images. While the cover can give you some information, you can't know the book's specific content. Now, suppose you can directly look at the book's table of contents and chapter titles, which is like using interleaved text+image representations. You can more accurately find the information you need because you can see the structure and content of the book. This is the core of this study: improving the accuracy of scientific document retrieval by preserving the structural information of documents.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a treasure hunt game, and you need to find the treasure hidden in a big house. The traditional way is to guess the treasure's location by the house's appearance, just like treating documents as images. While you can see the house's exterior, you don't know the layout and details inside. Now, imagine you have a map of the house with each room and its contents labeled, which is like using interleaved text+image representations. You can find the treasure faster because you know what's in each room. That's the core of this study: improving the accuracy of scientific document retrieval by preserving the structural information of documents.
Glossary
Document-as-Image
A method of embedding rendered pages of documents as images. This approach performs poorly when handling text-rich scientific documents.
Used as input for vision-language models in scientific document retrieval.
Multimodal Representation
A representation method that combines information from multiple modalities, such as text and images.
Used to improve the accuracy of scientific document retrieval.
Vision-Language Model
A model capable of processing both images and text, typically used for multimodal tasks.
Used in this paper to generate descriptions of images.
LaTeX
A markup language used for typesetting scientific documents, capable of preserving the structural information of documents.
Used to construct the ArXivDoc benchmark.
nDCG@10
A metric used to evaluate retrieval performance, measuring the ranking of relevant documents in the retrieval results.
Used to evaluate the retrieval performance of different representation methods.
Single-Vector Model
A model that represents an entire document or its part as a single vector.
Used to compare the retrieval performance of different representation methods.
Multi-Vector Model
A model that represents a document as multiple vectors, typically used to capture more fine-grained information.
Used to compare the retrieval performance of different representation methods.
ArXivDoc
A new benchmark for analyzing the effectiveness of different representation methods in scientific document retrieval.
Proposed in this paper to evaluate text, image, and multimodal representations.
Interleaved Text+Image Representation
A representation method that combines text and images, preserving the structural information of documents.
Proven in this paper to outperform document-as-image representations.
OCR (Optical Character Recognition)
A technology that converts text in images into editable text.
Used when processing documents without structured sources.
Open Questions Unanswered questions from this research
- 1 Open question 1: How to improve the retrieval performance of multimodal representations without increasing computational complexity? Existing methods perform poorly when handling long documents, requiring more efficient models.
- 2 Open question 2: How to expand the ArXivDoc benchmark to cover more domains and document types? The current benchmark focuses primarily on scientific documents, which may not be applicable to other types of documents.
- 3 Open question 3: How to better integrate text and image information in multimodal representations? Existing methods perform poorly when handling complex document content, requiring more advanced models.
- 4 Open question 4: How to improve the retrieval performance of interleaved text+image representations without increasing computational complexity? Existing methods perform poorly when handling long documents, requiring more efficient models.
- 5 Open question 5: How to better integrate text and image information in multimodal representations? Existing methods perform poorly when handling complex document content, requiring more advanced models.
- 6 Open question 6: How to improve the retrieval performance of multimodal representations without increasing computational complexity? Existing methods perform poorly when handling long documents, requiring more efficient models.
- 7 Open question 7: How to expand the ArXivDoc benchmark to cover more domains and document types? The current benchmark focuses primarily on scientific documents, which may not be applicable to other types of documents.
Applications
Immediate Applications
Scientific Document Retrieval
By preserving the structural information of documents, improve the accuracy of scientific document retrieval, applicable to academic research and industrial applications.
Multimodal Information Processing
Combine information from text and images to improve the efficiency of multimodal information processing, applicable to applications that require handling complex document content.
Vision-Language Model Applications
Utilize vision-language models to generate image descriptions, improving the usability of image information, applicable to applications that require processing image information.
Long-term Vision
Cross-Domain Document Retrieval
Expand the ArXivDoc benchmark to cover more domains and document types, improving the performance of cross-domain document retrieval.
Advanced Multimodal Model Development
Develop more advanced multimodal models to enhance the performance of scientific document retrieval, applicable to applications that require handling complex document content.
Abstract
Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and multi-vector retrieval models. Our results show that: (1) document-as-image representations are consistently suboptimal, especially as document length increases; (2) text-based representations are most effective, even for figure-based queries, by leveraging captions and surrounding context; and (3) interleaved text+image representations outperform document-as-image approaches without requiring specialized training.
References (20)
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Kuicai Dong, Yujing Chang, Derrick-Goh-Xin Deik et al.
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Mingxin Li, Yanzhao Zhang, Dingkun Long et al.
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu et al.
E5-V: Universal Embeddings with Multimodal Large Language Models
Ting Jiang, Minghui Song, Zihan Zhang et al.
Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark
Hao Guo, Xugong Qin, Jun Jie Ou Yang et al.
An Overview of the Tesseract OCR Engine
Raymond W. Smith
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan et al.
Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks
Yauhen Babakhin, Radek Osmulski, Ronay Ak et al.
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Siwei Han, Peng Xia, Ruiyi Zhang et al.
Glyph: Scaling Context Windows via Visual-Text Compression
Jiale Cheng, Yusen Liu, Xinyu Zhang et al.
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy et al.
MultiModalQA: Complex Question Answering over Text, Tables and Images
Alon Talmor, Ori Yoran, Amnon Catav et al.
ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval
Quentin Macé, Ant'onio Loison, Manuel Faysse
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Shi Yu, Chaoyue Tang, Bokai Xu et al.
Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation
Yejin Choi, Jaewoo Park, Janghan Yoon et al.
Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models
Xin Zhang, Yanzhao Zhang, Wen Xie et al.
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Cheng Cui, Ting Sun, Suyin Liang et al.
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
Michael Günther, Saba Sturua, Mohammad Kalim Akram et al.
PixelWorld: Towards Perceiving Everything as Pixels
Z. Lyu, Xueguang Ma, Wenhu Chen
Mitigating the Impact of False Negative in Dense Retrieval with Contrastive Confidence Regularization
Shiqi Wang, Yeqin Zhang, Cam-Tu Nguyen