GLM-OCR Technical Report

TL;DR

GLM-OCR combines CogViT visual encoder and GLM language decoder to enhance document understanding efficiency.

cs.CL 🔴 Advanced 2026-03-11 13 views

Shuaiqi Duan Yadong Xue Weihan Wang Zhe Su Huan Liu Sheng Yang Guobing Gan Guo Wang Zihan Wang Shengdong Yan Dexin Jin Yuxuan Zhang Guohong Wen Yanfeng Wang Yutao Zhang Xiaohan Zhang Wenyi Hong Yukuo Cen Da Yin Bin Chen Wenmeng Yu Xiaotao Gu Jie Tang

AI Reader Arxiv Page Download PDF

multimodal model document parsing OCR efficient decoding information extraction

Key Findings

Methodology

GLM-OCR integrates a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, forming a compact multimodal model. The model employs a Multi-Token Prediction (MTP) mechanism, predicting multiple tokens per step, significantly improving decoding throughput while maintaining low memory overhead through parameter sharing. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 performs layout analysis, followed by parallel region-level recognition.

Key Results

On OmniDocBench v1.5, GLM-OCR achieved an overall score of 94.6, surpassing many large multimodal models.
It scored 94.0 on OCRBench (Text) and 96.5 on UniMERNet, demonstrating its excellence in text recognition and formula transcription.
It achieved 85.2 on PubTabNet, showing its competitiveness in table structure recovery.

Significance

GLM-OCR excels in document parsing, text and formula transcription, table structure recovery, and key information extraction, making it suitable for resource-constrained edge deployment and large-scale production systems. Its compact architecture and structured generation address the performance bottlenecks of traditional OCR systems in complex layouts and diverse document formats.

Technical Contribution

GLM-OCR significantly enhances the efficiency and performance of document understanding tasks by introducing the Multi-Token Prediction (MTP) mechanism and a two-stage pipeline architecture. Compared to existing large multimodal models, GLM-OCR achieves high recognition performance while greatly reducing computational costs and memory consumption.

Novelty

GLM-OCR is the first to introduce the Multi-Token Prediction (MTP) mechanism in OCR tasks, addressing the inefficiency of traditional autoregressive generation in deterministic OCR tasks. Its performance improvement is particularly notable in long structured outputs like tables compared to existing methods.

Limitations

GLM-OCR may still face challenges when dealing with extremely complex document layouts, especially during the layout analysis phase.
In multilingual environments, the model may require further fine-tuning to ensure high accuracy.

Future Work

Future work can focus on further optimizing the model's performance on multilingual and multi-format documents, and exploring more efficient parameter-sharing mechanisms to further reduce computational costs.

AI Executive Summary

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

On OmniDocBench v1.5, GLM-OCR achieved an overall score of 94.6, surpassing many large multimodal models. It scored 94.0 on OCRBench (Text) and 96.5 on UniMERNet, demonstrating its excellence in text recognition and formula transcription. It achieved 85.2 on PubTabNet, showing its competitiveness in table structure recovery. Additionally, on information extraction benchmarks such as Nanonets-KIE and Handwritten-Forms, GLM-OCR's performance is comparable to significantly larger general multimodal models.

Beyond public benchmarks, GLM-OCR was evaluated on six high-frequency real-world scenarios, including code document parsing, natural-scene table recognition, handwritten text recognition, multilingual OCR, seal recognition, and receipt KIE. GLM-OCR consistently delivers strong results across all settings, achieving 91.5 on real-world table recognition, 90.5 on seal recognition, and 94.5 on receipt KIE. These results indicate that GLM-OCR generalizes beyond curated benchmarks and remains effective under practical production conditions.

GLM-OCR's compact parameter scale makes it highly optimized for localized inference and resource-constrained environments. The model supports efficient deployment across mainstream frameworks, including vLLM, SGLang, and Ollama. To facilitate seamless integration, a comprehensive SDK is provided for end-to-end document parsing workflows.

For cloud-based deployments, GLM-OCR is accessible via a MaaS API. The service employs a highly cost-effective, unified pricing model, significantly reducing operational overhead and decreasing processing costs to approximately one-tenth of those associated with traditional OCR solutions. GLM-OCR also supports direct fine-tuning using the LLaMA-Factory framework to meet specific domain adaptation or enhanced task performance needs.

Deep Analysis

Background

Document understanding is a core capability in modern information systems, supporting the extraction and structuring of knowledge from visually rich and layout-intensive documents such as financial reports, scientific articles, contracts, and invoices. Traditional OCR systems mainly focus on plain text transcription and rely on multi-stage pipelines with handcrafted rules for layout parsing and downstream information extraction. While effective for simple scenarios, these approaches often struggle with complex layouts, diverse document formats, and real-world production requirements. Recent multimodal large language models (MLLMs) unify visual perception and language understanding within a single framework and significantly improve document understanding performance. However, their large model size and autoregressive decoding paradigm lead to high computational cost, slow inference, and substantial memory consumption, which makes large-scale deployment under high-concurrency or edge environments challenging.

Core Problem

In practical production systems, document intelligence solutions must simultaneously provide: strong performance on complex content such as tables, formulas, code, and seals; high-throughput and low-latency inference; and flexible integration and domain adaptation. GLM-OCR is developed to address these system-level requirements within a unified multimodal framework.

Innovation

GLM-OCR is built on the GLM-V encoder-decoder framework, combining a 0.4B-scale CogViT visual encoder trained on large-scale image-text data, a lightweight cross-modal connector, and a 0.5B-scale GLM language decoder. The entire model contains only 0.9B parameters, enabling high-throughput and low-latency inference while maintaining strong recognition performance. Beyond architectural optimization, GLM-OCR also considers the mismatch between conventional autoregressive generation and the characteristics of OCR tasks. OCR is inherently a deterministic task with strong local dependencies and explicit structural supervision, where strictly autoregressive token-by-token decoding is inefficient. Therefore, we introduce Multi-Token Prediction (MTP) into both training and inference. MTP enables the simultaneous prediction of multiple tokens, substantially improving training efficiency and decoding throughput while preserving recognition accuracy, and is particularly advantageous for long structured outputs such as tables.

Methodology

�� GLM-OCR combines a 0.4B-parameter CogViT visual encoder and a 0.5B-parameter GLM language decoder.
�� It employs a Multi-Token Prediction (MTP) mechanism, predicting multiple tokens per step.
�� It maintains low memory overhead through parameter sharing.
�� At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 performs layout analysis, followed by parallel region-level recognition.
�� During training, GLM-OCR is trained to predict ten tokens per step and generates 5.2 tokens per decoding step on average at inference time.

Experiments

GLM-OCR was extensively evaluated on multiple public benchmarks and industrial scenarios. On OmniDocBench v1.5, GLM-OCR achieved an overall score of 94.6, surpassing many large multimodal models. It scored 94.0 on OCRBench (Text) and 96.5 on UniMERNet, demonstrating its excellence in text recognition and formula transcription. It achieved 85.2 on PubTabNet, showing its competitiveness in table structure recovery. Additionally, on information extraction benchmarks such as Nanonets-KIE and Handwritten-Forms, GLM-OCR's performance is comparable to significantly larger general multimodal models.

Results

GLM-OCR achieved an overall score of 94.6 on OmniDocBench v1.5, surpassing many large multimodal models. It scored 94.0 on OCRBench (Text) and 96.5 on UniMERNet, demonstrating its excellence in text recognition and formula transcription. It achieved 85.2 on PubTabNet, showing its competitiveness in table structure recovery. Additionally, on information extraction benchmarks such as Nanonets-KIE and Handwritten-Forms, GLM-OCR's performance is comparable to significantly larger general multimodal models.

Applications

Limitations & Outlook

GLM-OCR may still face challenges when dealing with extremely complex document layouts, especially during the layout analysis phase. In multilingual environments, the model may require further fine-tuning to ensure high accuracy. Additionally, while the Multi-Token Prediction (MTP) mechanism significantly improves decoding efficiency, there may still be performance bottlenecks in certain specific scenarios.

Plain Language Accessible to non-experts

Imagine you're in a large library with various books and documents. Traditional librarians need to flip through each book page by page, manually recording every word and sentence. This is like traditional OCR systems, which need to recognize text word by word, sentence by sentence, making it inefficient. GLM-OCR is like a super-smart librarian who can quickly recognize the content of each book and handle multiple books simultaneously, extracting key information swiftly. It uses a technique called Multi-Token Prediction, which is like flipping through pages with multiple hands at once, greatly improving efficiency. Additionally, it can recognize complex structures in books, like tables and formulas, just like understanding charts and mathematical formulas in books. In short, GLM-OCR is like an efficient librarian who can quickly and accurately process a large number of complex documents.

ELI14 Explained like you're 14

Hey there! Imagine you're playing a super cool game where you need to quickly find hidden treasures on a map. The traditional way is like slowly walking through every corner, carefully looking for every clue. But GLM-OCR is like a super detective who can look at multiple areas of the map at once and quickly find all the treasures! It uses a technique called Multi-Token Prediction, which is like using multiple hands to look at the map at once, greatly improving efficiency. Plus, it can recognize complex structures on the map, like mazes and traps, just like understanding charts and mathematical formulas. In short, GLM-OCR is like an efficient super detective who can quickly and accurately find all the hidden treasures!

Glossary

Multimodal Model

A model that combines multiple data modalities (e.g., images and text) to process and understand multiple types of information simultaneously.

GLM-OCR combines a visual encoder and a language decoder, forming a multimodal model.

CogViT Visual Encoder

A visual encoder used for image processing, capable of converting image information into features that can be processed by the model.

GLM-OCR uses the CogViT visual encoder to process document images.

GLM Language Decoder

A language decoder used for text generation, capable of converting the model's internal representations into natural language text.

GLM-OCR uses the GLM language decoder to generate text output.

Multi-Token Prediction (MTP)

A mechanism that predicts multiple tokens per step, improving decoding efficiency and throughput.

GLM-OCR introduces the Multi-Token Prediction mechanism to enhance decoding efficiency.

Layout Analysis

The process of identifying different structured regions in a document to enable more precise content recognition.

GLM-OCR uses PP-DocLayout-V3 for layout analysis.

PP-DocLayout-V3

A tool for document layout analysis, capable of detecting structured regions in documents.

The layout analysis stage of GLM-OCR is powered by PP-DocLayout-V3.

Parameter Sharing

A technique to reduce memory overhead by sharing model parameters.

GLM-OCR reduces memory overhead from Multi-Token Prediction through parameter sharing.

Information Extraction

The process of extracting key information from documents, typically used for generating structured data.

GLM-OCR excels in information extraction tasks.

Edge Deployment

The process of deploying models in resource-constrained devices or environments.

GLM-OCR is suitable for resource-constrained edge deployment.

Large-scale Production Systems

Systems capable of handling large amounts of data and high-concurrency requests.

GLM-OCR is suitable for large-scale production systems.

Open Questions Unanswered questions from this research

1 Although GLM-OCR performs well on various benchmarks, it may still face challenges in handling extremely complex document layouts. Future research could explore more advanced layout analysis techniques to further enhance model robustness.
2 In multilingual environments, GLM-OCR may require further fine-tuning to ensure high accuracy. Research could focus on developing more general multilingual models to improve cross-language performance.
3 While the Multi-Token Prediction (MTP) mechanism significantly improves decoding efficiency, there may still be performance bottlenecks in certain specific scenarios. Future research could explore more efficient parameter-sharing mechanisms to further reduce computational costs.
4 In practical applications, GLM-OCR may need to handle more diverse document formats and content. Research could focus on developing more flexible model architectures to adapt to the ever-changing document demands.
5 GLM-OCR may face challenges in handling handwritten text. Future research could explore more advanced handwriting recognition techniques to improve model accuracy.

Applications

Immediate Applications

Document Parsing

GLM-OCR can be used to parse complex document layouts and extract key information, suitable for scenarios like financial reports, contracts, and scientific articles.

Text Recognition

GLM-OCR excels in multilingual text recognition, suitable for enterprises and organizations that need to process multilingual content.

Table Structure Recovery

GLM-OCR can accurately recover table structures in documents, suitable for industries that need to process large amounts of tabular data, such as finance and market analysis.

Long-term Vision

Intelligent Document Management Systems

GLM-OCR can serve as a core component of intelligent document management systems, helping enterprises automate document processing workflows and improve efficiency.

Multimodal Information Retrieval

GLM-OCR can be used in multimodal information retrieval systems, combining visual and textual information to improve the accuracy and efficiency of information retrieval.

Abstract

cs.CL

References (20)

OCRBench: on the hidden mystery of OCR in large multimodal models

Yuliang Liu, Zhang Li, Mingxin Huang et al.

2023 322 citations ⭐ Influential View Analysis →

Image-based table recognition: data, model, and evaluation

Xu Zhong, Elaheh Shafieibavani, Antonio Jimeno-Yepes

2019 301 citations ⭐ Influential View Analysis →

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.

2023 4910 citations View Analysis →

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui, Ting Sun, Suyin Liang et al.

2025 26 citations View Analysis →

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang et al.

2026 7 citations View Analysis →

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu et al.

2025 3719 citations View Analysis →

GLM-5: from Vibe Coding to Agentic Engineering

GLM-4.5 Team Aohan Zeng, Xin Lv, Zhenyu Hou et al.

2026 9 citations View Analysis →

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

Jake Poznanski, Jon Borchardt, Jason Dunkelberger et al.

2025 63 citations View Analysis →

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen et al.

2025 1026 citations View Analysis →

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm Aohan Zeng, Bin Xu, Bowen Wang et al.

2024 1327 citations View Analysis →

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team Wenyi Hong, Wenmeng Yu, Xiaotao Gu et al.

2025 122 citations View Analysis →

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Hao Feng, Shubo Wei, Xiang Fei et al.

2025 40 citations View Analysis →

An Overview of the Tesseract OCR Engine

Raymond W. Smith

2007 2502 citations

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Junbo Niu, Zheng Liu, Zhuangcheng Gu et al.

2025 36 citations View Analysis →

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

Zhang Li, Yuliang Liu, Qiang Liu et al.

2025 31 citations View Analysis →

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, Yukun Li

2025 69 citations View Analysis →

DeepSeek-V3 Technical Report

DeepSeek-AI, A. Liu, B. Feng et al.

2024 2718 citations

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

Yuan Liu, Zhongyin Zhao, Le Tian et al.

2025 16 citations View Analysis →

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu et al.

2024 4933 citations View Analysis →

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu et al.

2025 222 citations View Analysis →

GLM-OCR Technical Report

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Multimodal Model

CogViT Visual Encoder

GLM Language Decoder

Multi-Token Prediction (MTP)

Layout Analysis

PP-DocLayout-V3

Parameter Sharing

Information Extraction

Edge Deployment

Large-scale Production Systems

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Document Parsing

Text Recognition

Table Structure Recovery

Long-term Vision

Intelligent Document Management Systems

Multimodal Information Retrieval

Abstract

References (20)

Related Papers

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection