Document-as-Image Representations Fall Short for Scientific Retrieval

TL;DR

Document-as-image representations underperform in scientific retrieval; interleaved text+image representations are more effective.

cs.IR 🔴 Advanced 2026-04-21 44 views

Ghazal Khalighinejad Raghuveer Thirukovalluru Alexander H. Oh Bhuwan Dhingra

AI Reader Arxiv Page Download PDF

scientific document retrieval multimodal representation LaTeX information retrieval vision-language models

Key Findings

Methodology

This paper introduces a new benchmark, ArXivDoc, to analyze the effectiveness of different representation methods in scientific document retrieval. By constructing documents from the LaTeX sources of scientific papers, the study examines the performance of text, image, and multimodal representations in both single-vector and multi-vector retrieval models. A systematic comparison reveals that interleaved text+image representations outperform document-as-image representations without requiring specialized training.

Key Results

Result 1: Document-as-image representations consistently underperform across all query types, especially as document length increases. Even for image-based queries, text representations with VLM-generated captions perform better.
Result 2: Text representations excel in image-based queries by leveraging captions and surrounding text context.
Result 3: Interleaved text+image representations outperform document-as-image approaches without requiring specialized training, indicating that combining modalities is more robust than relying solely on rendered pages.

Significance

This study challenges the prevailing paradigm of document-as-image representations in scientific document retrieval, highlighting the advantages of text and multimodal representations in handling structured scientific documents. The findings have significant implications for both academia and industry, particularly in applications requiring precise retrieval and analysis of complex document content.

Technical Contribution

Technical contributions include the introduction of a new benchmark, ArXivDoc, providing a systematic comparison of text, image, and multimodal representations. The study demonstrates that interleaved text+image representations outperform document-as-image representations without requiring specialized training, revealing the potential of multimodal integration.

Novelty

This research is the first to systematically compare text, image, and multimodal representations in scientific document retrieval. Unlike existing work, it emphasizes the advantages of text and multimodal representations in handling structured scientific documents.

Limitations

Limitation 1: The study focuses primarily on scientific documents, which may not be applicable to other types of documents.
Limitation 2: The models and datasets used in the experiments may limit the generalizability of the results.
Limitation 3: The specific needs of documents from different fields were not considered.

Future Work

Future directions include expanding the ArXivDoc benchmark to cover more domains and document types, and developing more advanced multimodal models to further enhance the performance of scientific document retrieval.

AI Executive Summary

In the field of scientific document retrieval, the traditional approach of using document-as-image representations faces challenges. Existing benchmarks, such as ArXivQA and ViDoRe, often treat documents as images of pages, which perform poorly when handling text-rich multimodal scientific documents. This paper introduces a new benchmark, ArXivDoc, constructed from the LaTeX sources of scientific papers, providing a systematic comparison of text, image, and multimodal representations.

The introduction of the ArXivDoc benchmark allows researchers to directly access the structured elements of scientific documents, such as sections, tables, figures, and equations, enabling precise query construction based on specific evidence types. By systematically comparing text, image, and multimodal representations, the study finds that interleaved text+image representations outperform document-as-image representations without requiring specialized training.

Experimental results show that document-as-image representations consistently underperform across all query types, especially as document length increases. Even for image-based queries, text representations excel by leveraging captions and surrounding text context. Furthermore, interleaved text+image representations outperform document-as-image approaches without requiring specialized training, indicating that combining modalities is more robust than relying solely on rendered pages.

This research has significant implications for both academia and industry, particularly in applications requiring precise retrieval and analysis of complex document content. It challenges the prevailing paradigm of document-as-image representations in scientific document retrieval, highlighting the advantages of text and multimodal representations in handling structured scientific documents.

Future research directions include expanding the ArXivDoc benchmark to cover more domains and document types, and developing more advanced multimodal models to further enhance the performance of scientific document retrieval.

Deep Analysis

Background

Scientific document retrieval is a crucial area of research in information retrieval, aiming to locate evidence relevant to a query from a large collection of documents. Traditional retrieval systems often represent documents as plain text or images. However, scientific documents typically contain rich multimodal information, such as text, tables, figures, and equations, distributed in a structured manner. Recently, with the advancement of vision-language models (VLMs), there has been a growing interest in representing documents as images. However, this approach performs poorly when handling text-rich multimodal scientific documents, as it obscures the structural information of the document.

Core Problem

The core problem is how to effectively represent and retrieve multimodal information in scientific documents. Existing methods often treat documents as images, which perform poorly when handling long documents and text-rich content. Moreover, this approach requires models to infer content boundaries and relationships, increasing the complexity of retrieval. Therefore, there is a need for a representation method that preserves the structural information of documents to improve retrieval performance.

Innovation

The core innovation of this paper lies in introducing a new benchmark, ArXivDoc, to analyze the effectiveness of different representation methods in scientific document retrieval. Specific innovations include:

1. Constructing documents from LaTeX sources, directly accessing structured elements such as sections, tables, figures, and equations.

2. Systematically comparing text, image, and multimodal representations in both single-vector and multi-vector retrieval models.

3. Demonstrating that interleaved text+image representations outperform document-as-image representations without requiring specialized training.

Methodology

The methodology of this paper includes the following steps:

�� Dataset Construction: Construct the ArXivDoc benchmark from the LaTeX sources of scientific papers, containing 8,210 documents and 547 manually verified queries.
�� Representation Comparison: Systematically compare text, image, and multimodal representations in both single-vector and multi-vector retrieval models.
�� Experimental Design: Use multiple embedding models to evaluate the retrieval performance of different representations, using nDCG@10 as the evaluation metric.
�� Results Analysis: Analyze the performance of different representations in text, table, and figure queries, revealing the advantages of interleaved text+image representations.

Experiments

The experimental design includes evaluating the retrieval performance of different representation methods using the ArXivDoc benchmark. The experiments use multiple embedding models, including text and image embedding models. Retrieval performance is evaluated using nDCG@10, and the results show that interleaved text+image representations outperform document-as-image representations without requiring specialized training. Additionally, the experiments analyze the performance of different representations in text, table, and figure queries.

Results

Applications

The applications of this research include precise retrieval and analysis of scientific documents, particularly in applications requiring handling of complex document content. By preserving the structural information of documents, interleaved text+image representations can improve retrieval performance, applicable to academic research and industrial applications.

Limitations & Outlook

The limitations of this paper include:

1. The study focuses primarily on scientific documents, which may not be applicable to other types of documents.

2. The models and datasets used in the experiments may limit the generalizability of the results.

3. The specific needs of documents from different fields were not considered. Future research can expand the ArXivDoc benchmark to cover more domains and document types.

Plain Language Accessible to non-experts

Imagine you're in a library trying to find a book on a specific topic. The traditional method is to judge the content by the book's cover, which is like representing documents as images. While the cover can give you some information, you can't know the book's specific content. Now, suppose you can directly look at the book's table of contents and chapter titles, which is like using interleaved text+image representations. You can more accurately find the information you need because you can see the structure and content of the book. This is the core of this study: improving the accuracy of scientific document retrieval by preserving the structural information of documents.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a treasure hunt game, and you need to find the treasure hidden in a big house. The traditional way is to guess the treasure's location by the house's appearance, just like treating documents as images. While you can see the house's exterior, you don't know the layout and details inside. Now, imagine you have a map of the house with each room and its contents labeled, which is like using interleaved text+image representations. You can find the treasure faster because you know what's in each room. That's the core of this study: improving the accuracy of scientific document retrieval by preserving the structural information of documents.

Glossary

Document-as-Image

A method of embedding rendered pages of documents as images. This approach performs poorly when handling text-rich scientific documents.

Used as input for vision-language models in scientific document retrieval.

Multimodal Representation

A representation method that combines information from multiple modalities, such as text and images.

Used to improve the accuracy of scientific document retrieval.

Vision-Language Model

A model capable of processing both images and text, typically used for multimodal tasks.

Used in this paper to generate descriptions of images.

LaTeX

A markup language used for typesetting scientific documents, capable of preserving the structural information of documents.

Used to construct the ArXivDoc benchmark.

nDCG@10

A metric used to evaluate retrieval performance, measuring the ranking of relevant documents in the retrieval results.

Used to evaluate the retrieval performance of different representation methods.

Single-Vector Model

A model that represents an entire document or its part as a single vector.

Used to compare the retrieval performance of different representation methods.

Multi-Vector Model

A model that represents a document as multiple vectors, typically used to capture more fine-grained information.

Used to compare the retrieval performance of different representation methods.

ArXivDoc

A new benchmark for analyzing the effectiveness of different representation methods in scientific document retrieval.

Proposed in this paper to evaluate text, image, and multimodal representations.

Interleaved Text+Image Representation

A representation method that combines text and images, preserving the structural information of documents.

Proven in this paper to outperform document-as-image representations.

OCR (Optical Character Recognition)

A technology that converts text in images into editable text.

Used when processing documents without structured sources.

Open Questions Unanswered questions from this research

1 Open question 1: How to improve the retrieval performance of multimodal representations without increasing computational complexity? Existing methods perform poorly when handling long documents, requiring more efficient models.
2 Open question 2: How to expand the ArXivDoc benchmark to cover more domains and document types? The current benchmark focuses primarily on scientific documents, which may not be applicable to other types of documents.
3 Open question 3: How to better integrate text and image information in multimodal representations? Existing methods perform poorly when handling complex document content, requiring more advanced models.
4 Open question 4: How to improve the retrieval performance of interleaved text+image representations without increasing computational complexity? Existing methods perform poorly when handling long documents, requiring more efficient models.
5 Open question 5: How to better integrate text and image information in multimodal representations? Existing methods perform poorly when handling complex document content, requiring more advanced models.
6 Open question 6: How to improve the retrieval performance of multimodal representations without increasing computational complexity? Existing methods perform poorly when handling long documents, requiring more efficient models.
7 Open question 7: How to expand the ArXivDoc benchmark to cover more domains and document types? The current benchmark focuses primarily on scientific documents, which may not be applicable to other types of documents.

Applications

Immediate Applications

Scientific Document Retrieval

By preserving the structural information of documents, improve the accuracy of scientific document retrieval, applicable to academic research and industrial applications.

Multimodal Information Processing

Combine information from text and images to improve the efficiency of multimodal information processing, applicable to applications that require handling complex document content.

Vision-Language Model Applications

Utilize vision-language models to generate image descriptions, improving the usability of image information, applicable to applications that require processing image information.

Long-term Vision

Cross-Domain Document Retrieval

Expand the ArXivDoc benchmark to cover more domains and document types, improving the performance of cross-domain document retrieval.

Advanced Multimodal Model Development

Develop more advanced multimodal models to enhance the performance of scientific document retrieval, applicable to applications that require handling complex document content.

Abstract

Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and multi-vector retrieval models. Our results show that: (1) document-as-image representations are consistently suboptimal, especially as document length increases; (2) text-based representations are most effective, even for figure-based queries, by leveraging captions and surrounding context; and (3) interleaved text+image representations outperform document-as-image approaches without requiring specialized training.

cs.IR cs.AI cs.CL

References (20)

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Kuicai Dong, Yujing Chang, Derrick-Goh-Xin Deik et al.

2025 29 citations ⭐ Influential View Analysis →

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long et al.

2026 60 citations ⭐ Influential View Analysis →

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu et al.

2024 130 citations ⭐ Influential View Analysis →

E5-V: Universal Embeddings with Multimodal Large Language Models

Ting Jiang, Minghui Song, Zihan Zhang et al.

2024 101 citations View Analysis →

Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

Hao Guo, Xugong Qin, Jun Jie Ou Yang et al.

2025 5 citations View Analysis →

An Overview of the Tesseract OCR Engine

Raymond W. Smith

2007 2550 citations

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

2024 3827 citations View Analysis →

Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks

Yauhen Babakhin, Radek Osmulski, Ronay Ak et al.

2025 24 citations View Analysis →

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Siwei Han, Peng Xia, Ruiyi Zhang et al.

2025 43 citations View Analysis →

Glyph: Scaling Context Windows via Visual-Text Compression

Jiale Cheng, Yusen Liu, Xinyu Zhang et al.

2025 26 citations View Analysis →

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy et al.

2021 46905 citations View Analysis →

MultiModalQA: Complex Question Answering over Text, Tables and Images

Alon Talmor, Ori Yoran, Amnon Catav et al.

2021 226 citations View Analysis →

ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval

Quentin Macé, Ant'onio Loison, Manuel Faysse

2025 36 citations View Analysis →

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu, Chaoyue Tang, Bokai Xu et al.

2024 172 citations View Analysis →

Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation

Yejin Choi, Jaewoo Park, Janghan Yoon et al.

2025 2 citations View Analysis →

Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models

Xin Zhang, Yanzhao Zhang, Wen Xie et al.

2025 23 citations

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui, Ting Sun, Suyin Liang et al.

2025 44 citations View Analysis →

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval

Michael Günther, Saba Sturua, Mohammad Kalim Akram et al.

2025 48 citations View Analysis →

PixelWorld: Towards Perceiving Everything as Pixels

Z. Lyu, Xueguang Ma, Wenhu Chen

2025 6 citations

Mitigating the Impact of False Negative in Dense Retrieval with Contrastive Confidence Regularization

Shiqi Wang, Yeqin Zhang, Cam-Tu Nguyen

2024 6 citations

Document-as-Image Representations Fall Short for Scientific Retrieval

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Document-as-Image

Multimodal Representation

Vision-Language Model

LaTeX

nDCG@10

Single-Vector Model

Multi-Vector Model

ArXivDoc

Interleaved Text+Image Representation

OCR (Optical Character Recognition)

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Scientific Document Retrieval

Multimodal Information Processing

Vision-Language Model Applications

Long-term Vision

Cross-Domain Document Retrieval

Advanced Multimodal Model Development

Abstract

References (20)

Related Papers

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough

ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

ECLASS-Augmented Semantic Product Search for Electronic Components