GLM-OCR Technical Report
GLM-OCR combines CogViT visual encoder and GLM language decoder to enhance document understanding efficiency.
Key Findings
Methodology
GLM-OCR integrates a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, forming a compact multimodal model. The model employs a Multi-Token Prediction (MTP) mechanism, predicting multiple tokens per step, significantly improving decoding throughput while maintaining low memory overhead through parameter sharing. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 performs layout analysis, followed by parallel region-level recognition.
Key Results
- On OmniDocBench v1.5, GLM-OCR achieved an overall score of 94.6, surpassing many large multimodal models.
- It scored 94.0 on OCRBench (Text) and 96.5 on UniMERNet, demonstrating its excellence in text recognition and formula transcription.
- It achieved 85.2 on PubTabNet, showing its competitiveness in table structure recovery.
Significance
GLM-OCR excels in document parsing, text and formula transcription, table structure recovery, and key information extraction, making it suitable for resource-constrained edge deployment and large-scale production systems. Its compact architecture and structured generation address the performance bottlenecks of traditional OCR systems in complex layouts and diverse document formats.
Technical Contribution
GLM-OCR significantly enhances the efficiency and performance of document understanding tasks by introducing the Multi-Token Prediction (MTP) mechanism and a two-stage pipeline architecture. Compared to existing large multimodal models, GLM-OCR achieves high recognition performance while greatly reducing computational costs and memory consumption.
Novelty
GLM-OCR is the first to introduce the Multi-Token Prediction (MTP) mechanism in OCR tasks, addressing the inefficiency of traditional autoregressive generation in deterministic OCR tasks. Its performance improvement is particularly notable in long structured outputs like tables compared to existing methods.
Limitations
- GLM-OCR may still face challenges when dealing with extremely complex document layouts, especially during the layout analysis phase.
- In multilingual environments, the model may require further fine-tuning to ensure high accuracy.
Future Work
Future work can focus on further optimizing the model's performance on multilingual and multi-format documents, and exploring more efficient parameter-sharing mechanisms to further reduce computational costs.
AI Executive Summary
GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.
On OmniDocBench v1.5, GLM-OCR achieved an overall score of 94.6, surpassing many large multimodal models. It scored 94.0 on OCRBench (Text) and 96.5 on UniMERNet, demonstrating its excellence in text recognition and formula transcription. It achieved 85.2 on PubTabNet, showing its competitiveness in table structure recovery. Additionally, on information extraction benchmarks such as Nanonets-KIE and Handwritten-Forms, GLM-OCR's performance is comparable to significantly larger general multimodal models.
Beyond public benchmarks, GLM-OCR was evaluated on six high-frequency real-world scenarios, including code document parsing, natural-scene table recognition, handwritten text recognition, multilingual OCR, seal recognition, and receipt KIE. GLM-OCR consistently delivers strong results across all settings, achieving 91.5 on real-world table recognition, 90.5 on seal recognition, and 94.5 on receipt KIE. These results indicate that GLM-OCR generalizes beyond curated benchmarks and remains effective under practical production conditions.
GLM-OCR's compact parameter scale makes it highly optimized for localized inference and resource-constrained environments. The model supports efficient deployment across mainstream frameworks, including vLLM, SGLang, and Ollama. To facilitate seamless integration, a comprehensive SDK is provided for end-to-end document parsing workflows.
For cloud-based deployments, GLM-OCR is accessible via a MaaS API. The service employs a highly cost-effective, unified pricing model, significantly reducing operational overhead and decreasing processing costs to approximately one-tenth of those associated with traditional OCR solutions. GLM-OCR also supports direct fine-tuning using the LLaMA-Factory framework to meet specific domain adaptation or enhanced task performance needs.
Deep Analysis
Background
Document understanding is a core capability in modern information systems, supporting the extraction and structuring of knowledge from visually rich and layout-intensive documents such as financial reports, scientific articles, contracts, and invoices. Traditional OCR systems mainly focus on plain text transcription and rely on multi-stage pipelines with handcrafted rules for layout parsing and downstream information extraction. While effective for simple scenarios, these approaches often struggle with complex layouts, diverse document formats, and real-world production requirements. Recent multimodal large language models (MLLMs) unify visual perception and language understanding within a single framework and significantly improve document understanding performance. However, their large model size and autoregressive decoding paradigm lead to high computational cost, slow inference, and substantial memory consumption, which makes large-scale deployment under high-concurrency or edge environments challenging.
Core Problem
In practical production systems, document intelligence solutions must simultaneously provide: strong performance on complex content such as tables, formulas, code, and seals; high-throughput and low-latency inference; and flexible integration and domain adaptation. GLM-OCR is developed to address these system-level requirements within a unified multimodal framework.
Innovation
GLM-OCR is built on the GLM-V encoder-decoder framework, combining a 0.4B-scale CogViT visual encoder trained on large-scale image-text data, a lightweight cross-modal connector, and a 0.5B-scale GLM language decoder. The entire model contains only 0.9B parameters, enabling high-throughput and low-latency inference while maintaining strong recognition performance. Beyond architectural optimization, GLM-OCR also considers the mismatch between conventional autoregressive generation and the characteristics of OCR tasks. OCR is inherently a deterministic task with strong local dependencies and explicit structural supervision, where strictly autoregressive token-by-token decoding is inefficient. Therefore, we introduce Multi-Token Prediction (MTP) into both training and inference. MTP enables the simultaneous prediction of multiple tokens, substantially improving training efficiency and decoding throughput while preserving recognition accuracy, and is particularly advantageous for long structured outputs such as tables.
Methodology
- �� GLM-OCR combines a 0.4B-parameter CogViT visual encoder and a 0.5B-parameter GLM language decoder.
- �� It employs a Multi-Token Prediction (MTP) mechanism, predicting multiple tokens per step.
- �� It maintains low memory overhead through parameter sharing.
- �� At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 performs layout analysis, followed by parallel region-level recognition.
- �� During training, GLM-OCR is trained to predict ten tokens per step and generates 5.2 tokens per decoding step on average at inference time.
Experiments
GLM-OCR was extensively evaluated on multiple public benchmarks and industrial scenarios. On OmniDocBench v1.5, GLM-OCR achieved an overall score of 94.6, surpassing many large multimodal models. It scored 94.0 on OCRBench (Text) and 96.5 on UniMERNet, demonstrating its excellence in text recognition and formula transcription. It achieved 85.2 on PubTabNet, showing its competitiveness in table structure recovery. Additionally, on information extraction benchmarks such as Nanonets-KIE and Handwritten-Forms, GLM-OCR's performance is comparable to significantly larger general multimodal models.
Results
GLM-OCR achieved an overall score of 94.6 on OmniDocBench v1.5, surpassing many large multimodal models. It scored 94.0 on OCRBench (Text) and 96.5 on UniMERNet, demonstrating its excellence in text recognition and formula transcription. It achieved 85.2 on PubTabNet, showing its competitiveness in table structure recovery. Additionally, on information extraction benchmarks such as Nanonets-KIE and Handwritten-Forms, GLM-OCR's performance is comparable to significantly larger general multimodal models.
Applications
GLM-OCR excels in document parsing, text and formula transcription, table structure recovery, and key information extraction, making it suitable for resource-constrained edge deployment and large-scale production systems. Its compact architecture and structured generation address the performance bottlenecks of traditional OCR systems in complex layouts and diverse document formats.
Limitations & Outlook
GLM-OCR may still face challenges when dealing with extremely complex document layouts, especially during the layout analysis phase. In multilingual environments, the model may require further fine-tuning to ensure high accuracy. Additionally, while the Multi-Token Prediction (MTP) mechanism significantly improves decoding efficiency, there may still be performance bottlenecks in certain specific scenarios.
Plain Language Accessible to non-experts
Imagine you're in a large library with various books and documents. Traditional librarians need to flip through each book page by page, manually recording every word and sentence. This is like traditional OCR systems, which need to recognize text word by word, sentence by sentence, making it inefficient. GLM-OCR is like a super-smart librarian who can quickly recognize the content of each book and handle multiple books simultaneously, extracting key information swiftly. It uses a technique called Multi-Token Prediction, which is like flipping through pages with multiple hands at once, greatly improving efficiency. Additionally, it can recognize complex structures in books, like tables and formulas, just like understanding charts and mathematical formulas in books. In short, GLM-OCR is like an efficient librarian who can quickly and accurately process a large number of complex documents.
ELI14 Explained like you're 14
Hey there! Imagine you're playing a super cool game where you need to quickly find hidden treasures on a map. The traditional way is like slowly walking through every corner, carefully looking for every clue. But GLM-OCR is like a super detective who can look at multiple areas of the map at once and quickly find all the treasures! It uses a technique called Multi-Token Prediction, which is like using multiple hands to look at the map at once, greatly improving efficiency. Plus, it can recognize complex structures on the map, like mazes and traps, just like understanding charts and mathematical formulas. In short, GLM-OCR is like an efficient super detective who can quickly and accurately find all the hidden treasures!
Glossary
Multimodal Model
A model that combines multiple data modalities (e.g., images and text) to process and understand multiple types of information simultaneously.
GLM-OCR combines a visual encoder and a language decoder, forming a multimodal model.
CogViT Visual Encoder
A visual encoder used for image processing, capable of converting image information into features that can be processed by the model.
GLM-OCR uses the CogViT visual encoder to process document images.
GLM Language Decoder
A language decoder used for text generation, capable of converting the model's internal representations into natural language text.
GLM-OCR uses the GLM language decoder to generate text output.
Multi-Token Prediction (MTP)
A mechanism that predicts multiple tokens per step, improving decoding efficiency and throughput.
GLM-OCR introduces the Multi-Token Prediction mechanism to enhance decoding efficiency.
Layout Analysis
The process of identifying different structured regions in a document to enable more precise content recognition.
GLM-OCR uses PP-DocLayout-V3 for layout analysis.
PP-DocLayout-V3
A tool for document layout analysis, capable of detecting structured regions in documents.
The layout analysis stage of GLM-OCR is powered by PP-DocLayout-V3.
Parameter Sharing
A technique to reduce memory overhead by sharing model parameters.
GLM-OCR reduces memory overhead from Multi-Token Prediction through parameter sharing.
Information Extraction
The process of extracting key information from documents, typically used for generating structured data.
GLM-OCR excels in information extraction tasks.
Edge Deployment
The process of deploying models in resource-constrained devices or environments.
GLM-OCR is suitable for resource-constrained edge deployment.
Large-scale Production Systems
Systems capable of handling large amounts of data and high-concurrency requests.
GLM-OCR is suitable for large-scale production systems.
Open Questions Unanswered questions from this research
- 1 Although GLM-OCR performs well on various benchmarks, it may still face challenges in handling extremely complex document layouts. Future research could explore more advanced layout analysis techniques to further enhance model robustness.
- 2 In multilingual environments, GLM-OCR may require further fine-tuning to ensure high accuracy. Research could focus on developing more general multilingual models to improve cross-language performance.
- 3 While the Multi-Token Prediction (MTP) mechanism significantly improves decoding efficiency, there may still be performance bottlenecks in certain specific scenarios. Future research could explore more efficient parameter-sharing mechanisms to further reduce computational costs.
- 4 In practical applications, GLM-OCR may need to handle more diverse document formats and content. Research could focus on developing more flexible model architectures to adapt to the ever-changing document demands.
- 5 GLM-OCR may face challenges in handling handwritten text. Future research could explore more advanced handwriting recognition techniques to improve model accuracy.
Applications
Immediate Applications
Document Parsing
GLM-OCR can be used to parse complex document layouts and extract key information, suitable for scenarios like financial reports, contracts, and scientific articles.
Text Recognition
GLM-OCR excels in multilingual text recognition, suitable for enterprises and organizations that need to process multilingual content.
Table Structure Recovery
GLM-OCR can accurately recover table structures in documents, suitable for industries that need to process large amounts of tabular data, such as finance and market analysis.
Long-term Vision
Intelligent Document Management Systems
GLM-OCR can serve as a core component of intelligent document management systems, helping enterprises automate document processing workflows and improve efficiency.
Multimodal Information Retrieval
GLM-OCR can be used in multimodal information retrieval systems, combining visual and textual information to improve the accuracy and efficiency of information retrieval.
Abstract
GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.
References (20)
OCRBench: on the hidden mystery of OCR in large multimodal models
Yuliang Liu, Zhang Li, Mingxin Huang et al.
Image-based table recognition: data, model, and evaluation
Xu Zhong, Elaheh Shafieibavani, Antonio Jimeno-Yepes
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang et al.
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Cheng Cui, Ting Sun, Suyin Liang et al.
PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
Cheng Cui, Ting Sun, Suyin Liang et al.
Qwen2.5-VL Technical Report
Shuai Bai, Keqin Chen, Xuejing Liu et al.
GLM-5: from Vibe Coding to Agentic Engineering
GLM-4.5 Team Aohan Zeng, Xin Lv, Zhenyu Hou et al.
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Jake Poznanski, Jon Borchardt, Jason Dunkelberger et al.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen et al.
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team Glm Aohan Zeng, Bin Xu, Bowen Wang et al.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-V Team Wenyi Hong, Wenmeng Yu, Xiaotao Gu et al.
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Hao Feng, Shubo Wei, Xiang Fei et al.
An Overview of the Tesseract OCR Engine
Raymond W. Smith
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Junbo Niu, Zheng Liu, Zhuangcheng Gu et al.
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
Zhang Li, Yuliang Liu, Qiang Liu et al.
DeepSeek-OCR: Contexts Optical Compression
Haoran Wei, Yaofeng Sun, Yukun Li
DeepSeek-V3 Technical Report
DeepSeek-AI, A. Liu, B. Feng et al.
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Yuan Liu, Zhongyin Zhao, Le Tian et al.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu et al.
Seed1.5-VL Technical Report
Dong Guo, Faming Wu, Feida Zhu et al.