POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

TL;DR

POTATR is a lightweight 29M-parameter image-to-graph model that significantly improves page-level table extraction accuracy and efficiency.

cs.CV 🔴 Advanced 2026-06-09 63 views

Brandon Smock Libin Liang Max Sokolov Amrit Ramesh Valerie Faucon-Morin Tayyibah Khanam Maury Courtland

AI Reader Arxiv Page Download PDF

Document Understanding Table Extraction Deep Learning Model Compression Graph Neural Networks

Key Findings

Methodology

POTATR extends the TATR architecture by integrating a Transformer-based encoder-decoder framework for full-page table extraction. It employs pre-trained TSR (Table Structure Recognition) weights for initialization, then adds page-specific object classes (such as captions, footers, rotated tables) and a relation head to predict directed edges between detected elements. The model processes page images to output spatially grounded bounding boxes for each element, along with hierarchical relationships forming a directed graph. The core mechanism involves multi-head self-attention modules for feature enhancement, object queries for element detection, and a relation head based on an MLP to predict parent-child links. Training utilizes the PubTables-v2 dataset with multi-task loss functions, including classification, bounding box regression, and relation prediction, optimized through a multi-stage fine-tuning process.

Key Results

On the PubTables-v2 single page benchmark, POTATR achieves a GriTSCon score of 0.964, surpassing all tested models including state-of-the-art multi-modal large language models (MLLMs), while being over 130 times faster and roughly 300 times cheaper, demonstrating an optimal balance of accuracy and efficiency.
When combined with external OCR systems like EasyOCR, PaddleOCR, and docTR, the model attains a caption text F1 score of 0.979, indicating high accuracy in both structure recognition and text extraction, validating its practical deployment potential.
The spatial grounding of each element allows for visual verification and geometric text assignment, facilitating downstream tasks such as multi-page table merging and document scanning workflows.

Significance

This work addresses the critical bottleneck of high computational cost and limited scalability in existing large-scale document understanding models. By introducing a small, non-autoregressive, yet highly accurate model, it paves the way for real-world large-scale deployment of document processing systems. Its ability to produce spatially grounded, hierarchical graph representations enhances interpretability and robustness, especially in scanned document scenarios where OCR errors are prevalent. The modular design enables seamless integration with other models, fostering flexible and scalable document analysis pipelines. Overall, POTATR represents a significant step toward democratizing advanced document understanding, making high-precision table extraction accessible for diverse applications in industry and research.

Technical Contribution

The key technical contributions include: • Extending the TATR architecture from table structure recognition to full-page, page-level extraction by adding page-specific object classes and a relation head; • Incorporating spatial bounding box predictions to ground elements in the image, enabling visual verification and geometric reasoning; • Utilizing a Transformer encoder-decoder framework with multi-head self-attention for robust feature modeling; • Leveraging pre-trained TSR weights for initialization, significantly boosting performance and training efficiency; • Designing a relation head based on an MLP to predict directed parent-child edges, forming a hierarchical graph that captures complex relationships among page elements. These innovations collectively enable a lightweight yet powerful model capable of high-accuracy, real-time page-level table extraction.

Novelty

This work is novel in that it introduces a lightweight, non-autoregressive Transformer-based model specifically designed for full-page table extraction, integrating spatial grounding with hierarchical relation prediction. Unlike prior models such as Relationformer and EGTR, which rely on deformable DETR architectures and symmetric adjacency matrices, POTATR leverages the original DETR architecture with a simplified relation head, enabling the use of pre-trained TSR weights and achieving superior performance with fewer parameters. Its ability to predict directed hierarchical relations and incorporate page-level classes like captions and footers distinguishes it from existing approaches, offering a comprehensive and efficient solution for complex document layouts.

Limitations

The model is primarily trained and evaluated on English scientific articles from PubMed, and its generalization to other languages, document types, or non-scientific content remains unverified, potentially limiting its applicability in diverse real-world scenarios.
Handling multi-page tables and cross-page relationships is limited; the current architecture does not explicitly model inter-page connections, which are common in complex documents.
In extremely complex or degraded scanned documents, the spatial and relational predictions may degrade in accuracy, especially under poor image quality or severe layout distortions.
While lightweight, the model still requires GPU resources for real-time inference, which may be a constraint in resource-limited environments.

Future Work

Future research will focus on extending the model’s capabilities to multi-page and multi-modal documents, incorporating cross-page merging strategies, and improving robustness against degraded scans. Additionally, exploring domain adaptation techniques to generalize beyond scientific articles, as well as integrating more sophisticated OCR and language understanding modules, will be key directions. Further optimization for edge deployment and real-time processing in large-scale systems is also planned, aiming to make the technology more accessible and scalable across various industries.

AI Executive Summary

In today’s digital era, the rapid and accurate extraction of structured information from vast amounts of unstructured documents remains a significant challenge. Traditional rule-based or heuristic methods often falter when faced with complex layouts, multi-modal content, or scanned images with poor quality. Recent advances in deep learning, particularly Transformer-based models, have shown promising results in document understanding tasks, including table detection and structure recognition. However, these models tend to be prohibitively large, computationally intensive, and costly to deploy at scale, limiting their practical utility in real-world applications.

Addressing these limitations, Brandon Smock and colleagues introduce POTATR (Page-Object Table Transformer), a novel lightweight model designed explicitly for page-level table extraction. With only 29 million parameters, POTATR leverages a Transformer encoder-decoder architecture, extending the capabilities of the existing TATR model to full-page scenarios. Its core innovation lies in transforming the traditional sequential text generation paradigm into a parallelized spatial graph prediction task, enabling efficient and accurate recognition of tables, captions, footers, and their hierarchical relationships.

The model’s architecture integrates a pre-trained TSR backbone, which provides a strong initialization for table structure recognition. It adds page-specific object classes, including rotated tables and page elements like captions and footers, and employs a relation head to predict directed edges between elements. This design results in a spatially grounded, hierarchical graph representation of page content, facilitating visual verification and geometric text assignment. The use of self-attention mechanisms enhances feature modeling, while the relation head captures complex parent-child relationships, forming a comprehensive understanding of page layout.

Experimental results on the PubTables-v2 dataset demonstrate the model’s remarkable performance. POTATR achieves a GriTSCon score of 0.964, outperforming all existing models, including large multimodal language models, while being over 130 times faster and 300 times cheaper in inference costs. When combined with external OCR systems, it attains a caption text F1 score of 0.979, confirming its robustness in real-world scenarios. Its spatial grounding capability allows for visual validation, making it suitable for downstream tasks such as multi-page table merging and document digitization workflows.

This research signifies a major step forward in scalable, efficient document understanding. By reducing model size and computational demands without sacrificing accuracy, POTATR opens new avenues for deploying intelligent document processing systems across industries. Its modular design enables seamless integration with other models, fostering flexible and extensible pipelines. Future work will explore multi-page and multi-modal extensions, cross-page relationship modeling, and domain adaptation to broaden applicability. Overall, POTATR exemplifies how innovative model design can bridge the gap between academic research and industrial needs, bringing high-precision, low-cost document analysis within reach for large-scale applications.

Deep Dive

Abstract

Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and efficient. Yet current approaches require billions of parameters, hundreds of autoregressive steps, or costly API inference. Motivated by this, we introduce the Page-Object Table Transformer (POTATR), a lightweight 29M parameter image-to-graph model that extends the Table Transformer (TATR) for contextualized page-level TE. POTATR outperforms all models tested on the PubTables-v2 Single Pages benchmark -- including frontier MLLMs -- achieving $\textrm{GriTS}_\textrm{Con}$ of 0.964 while running over 130$\times$ faster at roughly 300$\times$ lower cost. Further, POTATR's output is spatially grounded: every recognized element has a bounding box, enabling visual verification and geometric text assignment. As a result, POTATR performs unified page-level TE while composing with other models, enabling extension to scanned documents via external OCR and to full-document TE via techniques like cross-page merging. Code and models will be released.

cs.CV

POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

EventDrive: Event Cameras for Vision-Language Driving Intelligence