NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval
NanoVDR distills a 2B vision-language retriever into a 70M text-only encoder for visual document retrieval, retaining 95.1% of teacher quality.
Key Findings
Methodology
NanoVDR employs an asymmetric knowledge distillation framework, distilling a frozen 2B vision-language model (VLM) teacher into a 69M parameter text-only student model. The method uses pointwise cosine alignment to train the student model to accurately represent queries in the teacher's visual space. This process requires only pre-cached teacher query embeddings and no document image processing during training. Additionally, performance bottlenecks in cross-lingual transfer are addressed by augmenting training data with machine-translated queries.
Key Results
- NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1% of teacher quality on ViDoRe v2 and v3, with 32× fewer parameters and 50× lower CPU query latency compared to DSE-Qwen2 (2B).
- Across 22 ViDoRe benchmark datasets, NanoVDR's pointwise cosine alignment method consistently outperforms ranking-based and contrastive alternatives on query text.
- With multilingual augmentation, NanoVDR-S-Multi improves performance on multilingual queries, notably increasing NDCG@5 by 9.3 points on Portuguese queries.
Significance
NanoVDR's research is significant in the field of visual document retrieval. By decoupling complex visual documents from simple text queries, it significantly reduces computational costs and latency. This approach not only provides an efficient solution in academia but also offers potential applications in industry, especially in scenarios requiring rapid response. Furthermore, with multilingual augmentation, NanoVDR demonstrates its potential for global applications.
Technical Contribution
NanoVDR's technical contributions lie in its innovative asymmetric distillation framework. Compared to existing multi-vector VLM methods, NanoVDR achieves higher efficiency and storage savings through single-vector cosine similarity retrieval. Additionally, NanoVDR achieves cross-modal distillation with a pure text student model, eliminating the need for a vision module and addressing cross-lingual transfer bottlenecks through multilingual augmentation.
Novelty
NanoVDR is the first to separate the visual and text processing paths in visual document retrieval through an asymmetric distillation framework. This approach significantly enhances efficiency compared to traditional symmetric VLM methods and achieves higher accuracy through pointwise cosine alignment. Compared to the most related work, NanoVDR achieves cross-modal distillation without requiring a vision module.
Limitations
- NanoVDR's performance ceiling is determined by the teacher model's document embedding quality, so the student model cannot surpass the teacher's performance.
- While NanoVDR excels in text queries, it still relies on the teacher model's high-quality embeddings for complex visual content.
- The study does not explore reducing the offline indexing cost, which still requires the full 2B VLM teacher model to encode each document image.
Future Work
Future research directions include exploring ways to reduce the computational cost of offline indexing, such as through teacher model compression or progressive indexing. Additionally, whether the NanoVDR framework can be generalized to other retrieval settings is a question worth exploring. Researchers can also further optimize multilingual augmentation strategies to enhance performance in more languages.
AI Executive Summary
Visual document retrieval (VDR) has achieved remarkable effectiveness in extracting information from visually rich documents. However, state-of-the-art systems often rely on large vision-language models (VLMs), which require high computational overhead at query time, especially for plain-text queries. NanoVDR addresses this issue by decoupling complex visual documents from simple text queries through an asymmetric distillation framework.
At the core of NanoVDR is its innovative distillation method. A frozen 2B VLM teacher model is used for offline document indexing, while a lightweight text student model is used for online query encoding. Through pointwise cosine alignment, NanoVDR achieves efficient query encoding without needing to process document images. This method significantly reduces computational costs and latency, allowing NanoVDR to run on CPUs in approximately 50 milliseconds.
In experiments, NanoVDR performs excellently across 22 ViDoRe benchmark datasets. NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1% of teacher quality while reducing parameters by 32× and CPU query latency by 50× compared to DSE-Qwen2 (2B). Additionally, with multilingual augmentation, NanoVDR improves performance on multilingual queries, notably increasing NDCG@5 by 9.3 points on Portuguese queries.
NanoVDR's research is significant in both academia and industry. It not only provides an efficient solution but also offers potential applications in scenarios requiring rapid response. With multilingual augmentation, NanoVDR demonstrates its potential for global applications, especially in multilingual environments.
However, NanoVDR has its limitations. Its performance ceiling is determined by the teacher model's document embedding quality, so the student model cannot surpass the teacher's performance. Additionally, the study does not explore reducing the offline indexing cost, which still requires the full 2B VLM teacher model to encode each document image. Future research directions include exploring ways to reduce the computational cost of offline indexing and whether the NanoVDR framework can be generalized to other retrieval settings.
Deep Analysis
Background
Visual document retrieval (VDR) has made significant progress in extracting information from visually rich documents in recent years. Traditional optical character recognition (OCR)-based text extraction methods often struggle with complex document structures, whereas vision-language models (VLMs) have significantly improved retrieval quality by treating document pages as images. Representative works include ColPali and DSE, which achieve high-precision document retrieval through multi-vector embeddings. However, these systems typically require large VLM models for both query and document encoding, leading to high computational overhead and GPU dependence at query time, especially for plain-text queries.
Core Problem
Existing VLM methods face the issue of symmetric design in handling visual document retrieval, using the same multi-billion parameter encoder for both document indexing and query encoding. This design results in high computational latency and GPU dependence, even for simple plain-text queries. Additionally, cross-lingual transfer is a major performance bottleneck, particularly in multilingual environments.
Innovation
NanoVDR addresses these issues through an asymmetric distillation framework. Its core innovations include:
1. Asymmetric encoding paths: separating complex visual documents from simple text queries, using a frozen 2B VLM teacher model for offline document indexing and a lightweight text student model for online query encoding.
2. Pointwise cosine alignment: training the student model through this method to accurately represent queries in the teacher's visual space, significantly improving efficiency.
3. Multilingual augmentation: addressing cross-lingual transfer performance bottlenecks by augmenting training data with machine-translated queries, particularly in multilingual environments.
Methodology
NanoVDR's methodology includes the following key steps:
- �� A frozen VLM teacher model is used for offline document indexing, generating single-vector visual embeddings.
- �� A lightweight text student model is trained through pointwise cosine alignment to accurately represent queries in the teacher's visual space.
- �� Multilingual augmentation is implemented by adding machine-translated query data to address cross-lingual transfer performance bottlenecks.
- �� During training, only pre-cached teacher query embeddings are required, with no need to process document images.
Experiments
NanoVDR's experimental design includes evaluation on 22 ViDoRe benchmark datasets, covering various document types and languages. Baseline models include multi-vector and single-vector VLM methods such as ColPali and DSE. The primary evaluation metric is NDCG@5, with ablation studies conducted to validate the effects of different distillation objectives. Key hyperparameters include the capacity of the student model and the amount of multilingual augmentation data.
Results
Experimental results show that NanoVDR performs excellently across 22 ViDoRe benchmark datasets. NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1% of teacher quality while reducing parameters by 32× and CPU query latency by 50× compared to DSE-Qwen2 (2B). Additionally, with multilingual augmentation, NanoVDR improves performance on multilingual queries, notably increasing NDCG@5 by 9.3 points on Portuguese queries. Ablation studies indicate that pointwise cosine alignment consistently outperforms ranking-based and contrastive alternatives on query text.
Applications
NanoVDR's application scenarios include rapid-response visual document retrieval tasks such as retrieving key information from financial reports, scientific papers, and industrial manuals. Its efficient query encoding and low latency make it suitable for resource-constrained environments such as mobile devices and edge computing. Additionally, with multilingual augmentation, NanoVDR has broad application potential in multilingual environments.
Limitations & Outlook
NanoVDR's limitations include its performance ceiling being determined by the teacher model's document embedding quality, so the student model cannot surpass the teacher's performance. Additionally, the study does not explore reducing the offline indexing cost, which still requires the full 2B VLM teacher model to encode each document image. Future research directions include exploring ways to reduce the computational cost of offline indexing and whether the NanoVDR framework can be generalized to other retrieval settings.
Plain Language Accessible to non-experts
Imagine you're in a massive library searching for a book. This library has all sorts of books, some with very complex covers and others quite simple. Traditional methods are like having to read every book cover carefully to find the one you want. NanoVDR, however, is like having a super-smart assistant in the library who has memorized all the book covers in advance. When you tell him which book you're looking for, he can tell you where it is in just a few seconds because he only needs to remember the book's name, not look at the cover each time. This assistant is not only smart but also multilingual, so even if you tell him the book's name in different languages, he can find it quickly. That's what NanoVDR does in visual document retrieval: by indexing complex visual information in advance, it can quickly find the needed document by processing only simple text queries.
ELI14 Explained like you're 14
Hey there! Have you ever wondered how your computer knows what you're looking for when you search for something online? It's like being in a giant library looking for a book. Traditional methods are like having to open every book cover to find the one you want. But that's too slow, right? NanoVDR is like a super-smart library assistant who has memorized all the book information in advance. When you tell him which book you're looking for, he can tell you where it is in just a few seconds! Plus, he can speak multiple languages, so even if you tell him the book's name in different languages, he can find it quickly. That's the magic of NanoVDR! It makes finding information online super fast and accurate.
Glossary
Visual Document Retrieval
A method for extracting information from visually rich documents, typically using vision-language models to encode document pages and queries.
Used in the paper to describe NanoVDR's application scenarios.
Knowledge Distillation
A method of transferring knowledge from a large model to a smaller one, often used to reduce computational overhead.
NanoVDR distills a 2B VLM teacher model into a 69M text student model.
Cross-modal
Involving interaction or conversion between different modalities, such as vision and text.
NanoVDR achieves cross-modal distillation by separating visual and text processing paths.
Multilingual Augmentation
A method of improving model performance in different languages by adding multilingual data.
NanoVDR addresses cross-lingual transfer bottlenecks through multilingual augmentation.
Pointwise Cosine Alignment
A training method that aligns embeddings by minimizing the cosine distance between student and teacher models.
NanoVDR uses pointwise cosine alignment to train the text student model.
Vision-Language Model
A model that processes both visual and text information, commonly used for tasks like visual document retrieval.
NanoVDR uses a frozen 2B VLM teacher model for offline document indexing.
Single-vector Embedding
A representation method that encodes a document or query as a single vector, often used to improve retrieval efficiency.
NanoVDR achieves efficient query encoding through single-vector embedding.
NDCG@5
A metric for evaluating the accuracy of information retrieval systems, considering both relevance and ranking order.
Used in the paper to evaluate NanoVDR's performance across different datasets.
GPU Dependence
The requirement for using graphics processing units (GPUs) for computation, typically for handling large models.
Traditional VLM methods require high computational overhead and GPU dependence at query time.
Ablation Study
A method of evaluating the impact of certain components on overall performance by removing or replacing them.
NanoVDR conducts ablation studies to validate the effects of different distillation objectives.
Open Questions Unanswered questions from this research
- 1 How can the computational cost of offline indexing be further reduced? Currently, NanoVDR still requires the full 2B VLM teacher model to encode each document image, which may limit its application in resource-constrained environments.
- 2 Can the NanoVDR framework be generalized to other retrieval settings? While it performs excellently in visual document retrieval, its applicability in other domains has yet to be verified.
- 3 How can multilingual augmentation strategies be further optimized in multilingual environments? While NanoVDR improves performance through multilingual augmentation, there is still room for improvement in certain languages.
- 4 How can the parameter count of the student model be further reduced without affecting performance? The current NanoVDR-S-Multi is highly efficient, but there may be further room for optimization.
- 5 How can the performance of the student model be improved when handling complex visual content? While NanoVDR excels in text queries, it still relies on the teacher model's high-quality embeddings for complex visual content.
Applications
Immediate Applications
Financial Report Retrieval
NanoVDR can be used to quickly retrieve key information from financial reports, especially in scenarios requiring rapid response.
Scientific Paper Retrieval
With efficient query encoding, NanoVDR can quickly find relevant literature in scientific research.
Industrial Manual Retrieval
In industrial environments, NanoVDR can help engineers quickly find the necessary technical documents and operation manuals.
Long-term Vision
Global Information Retrieval
With multilingual augmentation, NanoVDR has the potential to play a significant role in global information retrieval, especially in multilingual environments.
Applications in Resource-constrained Environments
NanoVDR's efficiency makes it suitable for resource-constrained environments such as mobile devices and edge computing, where it may see widespread application in the future.
Abstract
Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.
References (20)
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Shi Yu, Chaoyue Tang, Bokai Xu et al.
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Mingxin Li, Yanzhao Zhang, Dingkun Long et al.
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Michael Tschannen, Alexey Gritsenko, Xiao Wang et al.
ModernVBERT: Towards Smaller Visual Document Retrievers
Paul Teiletche, Quentin Macé, Max Conti et al.
Cumulated gain-based evaluation of IR techniques
K. Järvelin, Jaana Kekäläinen
OPUS-MT – Building open translation services for the World
J. Tiedemann, Santhosh Thottingal
Distilling the Knowledge in a Neural Network
Geoffrey E. Hinton, O. Vinyals, J. Dean
Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval
Hao Sun, Yingyan Hou, Jiayan Guo et al.
ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval
Quentin Macé, Ant'onio Loison, Manuel Faysse
CLIP-KD: An Empirical Study of CLIP Model Distillation
Chuanguang Yang, Zhulin An, Libo Huang et al.
Dense Passage Retrieval for Open-Domain Question Answering
Vladimir Karpukhin, Barlas Oğuz, Sewon Min et al.
Representation Learning with Contrastive Predictive Coding
Aäron van den Oord, Yazhe Li, O. Vinyals
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond et al.
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan et al.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers, Iryna Gurevych
RankDistil: Knowledge Distillation for Ranking
Sashank J. Reddi, Rama Kumar Pasumarthi, A. Menon et al.
Languages
Martin East
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
O. Khattab, M. Zaharia
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling
Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang et al.
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance
Kan Wu, Houwen Peng, Zhenghong Zhou et al.