NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

TL;DR

NanoVDR distills a 2B vision-language retriever into a 70M text-only encoder for visual document retrieval, retaining 95.1% of teacher quality.

cs.IR 🔴 Advanced 2026-03-13 2 views

Zhuchenyang Liu Yao Zhang Yu Xiao

AI Reader Arxiv Page Download PDF

visual document retrieval knowledge distillation cross-modal multilingual augmentation efficient encoding

Key Findings

Methodology

NanoVDR employs an asymmetric knowledge distillation framework, distilling a frozen 2B vision-language model (VLM) teacher into a 69M parameter text-only student model. The method uses pointwise cosine alignment to train the student model to accurately represent queries in the teacher's visual space. This process requires only pre-cached teacher query embeddings and no document image processing during training. Additionally, performance bottlenecks in cross-lingual transfer are addressed by augmenting training data with machine-translated queries.

Key Results

NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1% of teacher quality on ViDoRe v2 and v3, with 32× fewer parameters and 50× lower CPU query latency compared to DSE-Qwen2 (2B).
Across 22 ViDoRe benchmark datasets, NanoVDR's pointwise cosine alignment method consistently outperforms ranking-based and contrastive alternatives on query text.
With multilingual augmentation, NanoVDR-S-Multi improves performance on multilingual queries, notably increasing NDCG@5 by 9.3 points on Portuguese queries.

Significance

NanoVDR's research is significant in the field of visual document retrieval. By decoupling complex visual documents from simple text queries, it significantly reduces computational costs and latency. This approach not only provides an efficient solution in academia but also offers potential applications in industry, especially in scenarios requiring rapid response. Furthermore, with multilingual augmentation, NanoVDR demonstrates its potential for global applications.

Technical Contribution

NanoVDR's technical contributions lie in its innovative asymmetric distillation framework. Compared to existing multi-vector VLM methods, NanoVDR achieves higher efficiency and storage savings through single-vector cosine similarity retrieval. Additionally, NanoVDR achieves cross-modal distillation with a pure text student model, eliminating the need for a vision module and addressing cross-lingual transfer bottlenecks through multilingual augmentation.

Novelty

NanoVDR is the first to separate the visual and text processing paths in visual document retrieval through an asymmetric distillation framework. This approach significantly enhances efficiency compared to traditional symmetric VLM methods and achieves higher accuracy through pointwise cosine alignment. Compared to the most related work, NanoVDR achieves cross-modal distillation without requiring a vision module.

Limitations

NanoVDR's performance ceiling is determined by the teacher model's document embedding quality, so the student model cannot surpass the teacher's performance.
While NanoVDR excels in text queries, it still relies on the teacher model's high-quality embeddings for complex visual content.
The study does not explore reducing the offline indexing cost, which still requires the full 2B VLM teacher model to encode each document image.

Future Work

Future research directions include exploring ways to reduce the computational cost of offline indexing, such as through teacher model compression or progressive indexing. Additionally, whether the NanoVDR framework can be generalized to other retrieval settings is a question worth exploring. Researchers can also further optimize multilingual augmentation strategies to enhance performance in more languages.

AI Executive Summary

Visual document retrieval (VDR) has achieved remarkable effectiveness in extracting information from visually rich documents. However, state-of-the-art systems often rely on large vision-language models (VLMs), which require high computational overhead at query time, especially for plain-text queries. NanoVDR addresses this issue by decoupling complex visual documents from simple text queries through an asymmetric distillation framework.

At the core of NanoVDR is its innovative distillation method. A frozen 2B VLM teacher model is used for offline document indexing, while a lightweight text student model is used for online query encoding. Through pointwise cosine alignment, NanoVDR achieves efficient query encoding without needing to process document images. This method significantly reduces computational costs and latency, allowing NanoVDR to run on CPUs in approximately 50 milliseconds.

In experiments, NanoVDR performs excellently across 22 ViDoRe benchmark datasets. NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1% of teacher quality while reducing parameters by 32× and CPU query latency by 50× compared to DSE-Qwen2 (2B). Additionally, with multilingual augmentation, NanoVDR improves performance on multilingual queries, notably increasing NDCG@5 by 9.3 points on Portuguese queries.

NanoVDR's research is significant in both academia and industry. It not only provides an efficient solution but also offers potential applications in scenarios requiring rapid response. With multilingual augmentation, NanoVDR demonstrates its potential for global applications, especially in multilingual environments.

However, NanoVDR has its limitations. Its performance ceiling is determined by the teacher model's document embedding quality, so the student model cannot surpass the teacher's performance. Additionally, the study does not explore reducing the offline indexing cost, which still requires the full 2B VLM teacher model to encode each document image. Future research directions include exploring ways to reduce the computational cost of offline indexing and whether the NanoVDR framework can be generalized to other retrieval settings.

Deep Analysis

Background

Visual document retrieval (VDR) has made significant progress in extracting information from visually rich documents in recent years. Traditional optical character recognition (OCR)-based text extraction methods often struggle with complex document structures, whereas vision-language models (VLMs) have significantly improved retrieval quality by treating document pages as images. Representative works include ColPali and DSE, which achieve high-precision document retrieval through multi-vector embeddings. However, these systems typically require large VLM models for both query and document encoding, leading to high computational overhead and GPU dependence at query time, especially for plain-text queries.

Core Problem

Existing VLM methods face the issue of symmetric design in handling visual document retrieval, using the same multi-billion parameter encoder for both document indexing and query encoding. This design results in high computational latency and GPU dependence, even for simple plain-text queries. Additionally, cross-lingual transfer is a major performance bottleneck, particularly in multilingual environments.

Innovation

NanoVDR addresses these issues through an asymmetric distillation framework. Its core innovations include:

1. Asymmetric encoding paths: separating complex visual documents from simple text queries, using a frozen 2B VLM teacher model for offline document indexing and a lightweight text student model for online query encoding.

2. Pointwise cosine alignment: training the student model through this method to accurately represent queries in the teacher's visual space, significantly improving efficiency.

3. Multilingual augmentation: addressing cross-lingual transfer performance bottlenecks by augmenting training data with machine-translated queries, particularly in multilingual environments.

Methodology

NanoVDR's methodology includes the following key steps:

�� A frozen VLM teacher model is used for offline document indexing, generating single-vector visual embeddings.
�� A lightweight text student model is trained through pointwise cosine alignment to accurately represent queries in the teacher's visual space.
�� Multilingual augmentation is implemented by adding machine-translated query data to address cross-lingual transfer performance bottlenecks.
�� During training, only pre-cached teacher query embeddings are required, with no need to process document images.

Experiments

NanoVDR's experimental design includes evaluation on 22 ViDoRe benchmark datasets, covering various document types and languages. Baseline models include multi-vector and single-vector VLM methods such as ColPali and DSE. The primary evaluation metric is NDCG@5, with ablation studies conducted to validate the effects of different distillation objectives. Key hyperparameters include the capacity of the student model and the amount of multilingual augmentation data.

Results

Experimental results show that NanoVDR performs excellently across 22 ViDoRe benchmark datasets. NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1% of teacher quality while reducing parameters by 32× and CPU query latency by 50× compared to DSE-Qwen2 (2B). Additionally, with multilingual augmentation, NanoVDR improves performance on multilingual queries, notably increasing NDCG@5 by 9.3 points on Portuguese queries. Ablation studies indicate that pointwise cosine alignment consistently outperforms ranking-based and contrastive alternatives on query text.

Applications

NanoVDR's application scenarios include rapid-response visual document retrieval tasks such as retrieving key information from financial reports, scientific papers, and industrial manuals. Its efficient query encoding and low latency make it suitable for resource-constrained environments such as mobile devices and edge computing. Additionally, with multilingual augmentation, NanoVDR has broad application potential in multilingual environments.

Limitations & Outlook

NanoVDR's limitations include its performance ceiling being determined by the teacher model's document embedding quality, so the student model cannot surpass the teacher's performance. Additionally, the study does not explore reducing the offline indexing cost, which still requires the full 2B VLM teacher model to encode each document image. Future research directions include exploring ways to reduce the computational cost of offline indexing and whether the NanoVDR framework can be generalized to other retrieval settings.

Plain Language Accessible to non-experts

Imagine you're in a massive library searching for a book. This library has all sorts of books, some with very complex covers and others quite simple. Traditional methods are like having to read every book cover carefully to find the one you want. NanoVDR, however, is like having a super-smart assistant in the library who has memorized all the book covers in advance. When you tell him which book you're looking for, he can tell you where it is in just a few seconds because he only needs to remember the book's name, not look at the cover each time. This assistant is not only smart but also multilingual, so even if you tell him the book's name in different languages, he can find it quickly. That's what NanoVDR does in visual document retrieval: by indexing complex visual information in advance, it can quickly find the needed document by processing only simple text queries.

ELI14 Explained like you're 14

Hey there! Have you ever wondered how your computer knows what you're looking for when you search for something online? It's like being in a giant library looking for a book. Traditional methods are like having to open every book cover to find the one you want. But that's too slow, right? NanoVDR is like a super-smart library assistant who has memorized all the book information in advance. When you tell him which book you're looking for, he can tell you where it is in just a few seconds! Plus, he can speak multiple languages, so even if you tell him the book's name in different languages, he can find it quickly. That's the magic of NanoVDR! It makes finding information online super fast and accurate.

Glossary

Visual Document Retrieval

A method for extracting information from visually rich documents, typically using vision-language models to encode document pages and queries.

Used in the paper to describe NanoVDR's application scenarios.

Knowledge Distillation

A method of transferring knowledge from a large model to a smaller one, often used to reduce computational overhead.

NanoVDR distills a 2B VLM teacher model into a 69M text student model.

Cross-modal

Involving interaction or conversion between different modalities, such as vision and text.

NanoVDR achieves cross-modal distillation by separating visual and text processing paths.

Multilingual Augmentation

A method of improving model performance in different languages by adding multilingual data.

NanoVDR addresses cross-lingual transfer bottlenecks through multilingual augmentation.

Pointwise Cosine Alignment

A training method that aligns embeddings by minimizing the cosine distance between student and teacher models.

NanoVDR uses pointwise cosine alignment to train the text student model.

Vision-Language Model

A model that processes both visual and text information, commonly used for tasks like visual document retrieval.

NanoVDR uses a frozen 2B VLM teacher model for offline document indexing.

Single-vector Embedding

A representation method that encodes a document or query as a single vector, often used to improve retrieval efficiency.

NanoVDR achieves efficient query encoding through single-vector embedding.

NDCG@5

A metric for evaluating the accuracy of information retrieval systems, considering both relevance and ranking order.

Used in the paper to evaluate NanoVDR's performance across different datasets.

GPU Dependence

The requirement for using graphics processing units (GPUs) for computation, typically for handling large models.

Traditional VLM methods require high computational overhead and GPU dependence at query time.

Ablation Study

A method of evaluating the impact of certain components on overall performance by removing or replacing them.

NanoVDR conducts ablation studies to validate the effects of different distillation objectives.

Open Questions Unanswered questions from this research

1 How can the computational cost of offline indexing be further reduced? Currently, NanoVDR still requires the full 2B VLM teacher model to encode each document image, which may limit its application in resource-constrained environments.
2 Can the NanoVDR framework be generalized to other retrieval settings? While it performs excellently in visual document retrieval, its applicability in other domains has yet to be verified.
3 How can multilingual augmentation strategies be further optimized in multilingual environments? While NanoVDR improves performance through multilingual augmentation, there is still room for improvement in certain languages.
4 How can the parameter count of the student model be further reduced without affecting performance? The current NanoVDR-S-Multi is highly efficient, but there may be further room for optimization.
5 How can the performance of the student model be improved when handling complex visual content? While NanoVDR excels in text queries, it still relies on the teacher model's high-quality embeddings for complex visual content.

Applications

Immediate Applications

Financial Report Retrieval

NanoVDR can be used to quickly retrieve key information from financial reports, especially in scenarios requiring rapid response.

Scientific Paper Retrieval

With efficient query encoding, NanoVDR can quickly find relevant literature in scientific research.

Industrial Manual Retrieval

In industrial environments, NanoVDR can help engineers quickly find the necessary technical documents and operation manuals.

Long-term Vision

Global Information Retrieval

With multilingual augmentation, NanoVDR has the potential to play a significant role in global information retrieval, especially in multilingual environments.

Applications in Resource-constrained Environments

NanoVDR's efficiency makes it suitable for resource-constrained environments such as mobile devices and edge computing, where it may see widespread application in the future.

Abstract

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.

cs.IR cs.CV cs.LG

References (20)

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu, Chaoyue Tang, Bokai Xu et al.

2024 148 citations ⭐ Influential View Analysis →

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long et al.

2026 31 citations ⭐ Influential View Analysis →

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang et al.

2025 580 citations ⭐ Influential View Analysis →

ModernVBERT: Towards Smaller Visual Document Retrievers

Paul Teiletche, Quentin Macé, Max Conti et al.

2025 9 citations ⭐ Influential View Analysis →

Cumulated gain-based evaluation of IR techniques

K. Järvelin, Jaana Kekäläinen

2002 5335 citations ⭐ Influential

OPUS-MT – Building open translation services for the World

J. Tiedemann, Santhosh Thottingal

2020 692 citations

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, O. Vinyals, J. Dean

2015 23271 citations View Analysis →

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

Hao Sun, Yingyan Hou, Jiayan Guo et al.

2025 4 citations

ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval

Quentin Macé, Ant'onio Loison, Manuel Faysse

2025 28 citations View Analysis →

CLIP-KD: An Empirical Study of CLIP Model Distillation

Chuanguang Yang, Zhulin An, Libo Huang et al.

2023 94 citations View Analysis →

Dense Passage Retrieval for Open-Domain Question Answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min et al.

2020 5230 citations View Analysis →

Representation Learning with Contrastive Predictive Coding

Aäron van den Oord, Yazhe Li, O. Vinyals

2018 12703 citations View Analysis →

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond et al.

2019 9296 citations View Analysis →

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

2024 3501 citations View Analysis →

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers, Iryna Gurevych

2019 16641 citations View Analysis →

RankDistil: Knowledge Distillation for Ranking

Sashank J. Reddi, Rama Kumar Pasumarthi, A. Menon et al.

2021 39 citations

Languages

Martin East

2000 433 citations

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

O. Khattab, M. Zaharia

2020 1912 citations View Analysis →

Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang et al.

2021 475 citations View Analysis →

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

Kan Wu, Houwen Peng, Zhenghong Zhou et al.

2023 109 citations View Analysis →

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Visual Document Retrieval

Knowledge Distillation

Cross-modal

Multilingual Augmentation

Pointwise Cosine Alignment

Vision-Language Model

Single-vector Embedding

NDCG@5

GPU Dependence

Ablation Study

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Financial Report Retrieval

Scientific Paper Retrieval

Industrial Manual Retrieval

Long-term Vision

Global Information Retrieval

Applications in Resource-constrained Environments

Abstract

References (20)

Related Papers

Taming the Long Tail: Efficient Item-wise Sharpness-Aware Minimization for LLM-based Recommender Systems

Anchored Alignment: Preventing Positional Collapse in Multimodal Recommender Systems

FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning

VLM4Rec: Multimodal Semantic Representation for Recommendation with Large Vision-Language Models

InterDeepResearch: Enabling Human-Agent Collaborative Information Seeking through Interactive Deep Research

Federated Learning and Unlearning for Recommendation with Personalized Data Sharing