Why Doc OCR Nonetheless Stays a Exhausting Engineering Downside? What does it take to make OCR helpful for actual paperwork as an alternative of unpolluted demo pictures? And may a compact multimodal mannequin deal with parsing, tables, formulation, and structured extraction with out turning inference right into a useful resource bonfire?
That’s the drawback focused by GLM-OCR, launched by researchers from Zhipu AI and Tsinghua College. The analysis group presents GLM-OCR as a 0.9B-parameter compact multimodal mannequin for doc understanding. It combines a 0.4B CogViT visible encoder, a light-weight cross-modal connector, and a 0.5B GLM language decoder. The acknowledged purpose is to stability doc recognition high quality with decrease latency and decrease computational price than bigger multimodal methods.
Conventional OCR methods are sometimes good at plain textual content transcription, however they battle when paperwork comprise combined layouts, tables, formulation, code blocks, seals, and structured fields. Latest multimodal massive language fashions enhance doc understanding, however the analysis group argue that their dimension and normal autoregressive decoding make them costly for edge deployment and large-scale manufacturing. GLM-OCR is positioned as a smaller system constructed for these deployment constraints reasonably than as a general-purpose vision-language mannequin tailored to OCR as an afterthought.
A Compact Structure Constructed for OCR Workloads
A key technical level for this analysis is the usage of Multi-Token Prediction (MTP). Normal autoregressive decoding predicts one token at a time, which isn’t superb for OCR-style duties the place outputs are sometimes deterministic and regionally structured. GLM-OCR as an alternative predicts a number of tokens per step. The mannequin is skilled to foretell 10 tokens per step and generates 5.2 tokens per decoding step on common at inference time, yielding about 50% throughput enchancment. To maintain reminiscence overhead manageable, the implementation makes use of a parameter-sharing scheme throughout the draft fashions.
Two-Stage Format Parsing As a substitute of Flat Web page Studying
On the system stage, GLM-OCR adopts a two-stage pipeline. The primary stage makes use of PP-DocLayout-V3 for format evaluation, which detects structured areas on the web page. The second stage performs parallel region-level recognition over these detected areas. That is vital as a result of the mannequin is just not merely studying an entire web page left-to-right as a generic vision-language mannequin may. It first breaks down the web page into semantically significant areas, which improves effectivity and makes the system extra sturdy on paperwork with sophisticated layouts.
Doc Parsing and KIE Use Completely different Output Paths
The structure additionally separates two associated doc duties. For doc parsing, the pipeline makes use of format detection and area processing to provide structured outputs equivalent to Markdown and JSON. For Key Info Extraction (KIE), the analysis group describes a special path: the complete doc picture is fed to the mannequin with a activity immediate, and the mannequin immediately generates JSON containing the extracted fields. That distinction issues as a result of GLM-OCR is just not offered as a single monolithic page-to-text mannequin. It’s a structured era system with completely different working modes relying on the duty.
A 4-Stage Coaching Pipeline with Process-Particular Rewards
The coaching recipe is break up into 4 levels. Stage 1 trains the imaginative and prescient encoder on image-text pairs and grounding or retrieval information. Stage 2.1 performs multimodal pretraining on image-text, doc parsing, grounding, and VQA information. Stage 2.2 provides the MTP goal. Stage 3 is supervised fine-tuning on OCR-specific duties together with textual content recognition, components transcription, desk construction restoration, and KIE. Stage 4 applies reinforcement studying utilizing GRPO. The reward design is task-specific: Normalized Edit Distance for textual content recognition, CDM rating for components recognition, TEDS rating for desk recognition, and field-level F1 for KIE, together with structural penalties equivalent to repetition penalties, malformed construction penalties, and JSON validation constraints.
Benchmark Outcomes Present Sturdy Efficiency, With Vital Caveats
On public benchmarks, GLM-OCR experiences sturdy outcomes throughout a number of doc duties. It scores 94.6 on OmniDocBench v1.5, 94.0 on OCRBench (Textual content), 96.5 on UniMERNet, 85.2 on PubTabNet, and 86.0 on TEDS_TEST. For KIE, it experiences 93.7 on Nanonets-KIE and 86.1 on Handwritten-KIE. The analysis group notes that outcomes for Gemini-3-Professional and GPT-5.2-2025-12-11 are proven just for reference and are excluded from the best-score rating, which is a vital element when deciphering claims about mannequin management.
https://arxiv.org/pdf/2603.10910
The benchmark story is robust, however it wants cautious phrasing. GLM-OCR achieves the best reported scores among the many evaluated non-reference fashions on OmniDocBench v1.5, OCRBench (Textual content), UniMERNet, and TEDS_TEST. On PubTabNet, nonetheless, it does not lead total; MinerU 2.5 experiences 88.4 versus GLM-OCR’s 85.2. For KIE, GLM-OCR outperforms the listed open-source opponents within the above desk, however Gemini-3-Professional scores increased on each Nanonets-KIE and Handwritten-KIE within the reference column. So the reserach group helps a robust aggressive declare, however not a blanket ‘greatest at every thing’ declare.
Deployment Particulars
The analysis group state that GLM-OCR helps vLLM, SGLang, and Ollama, and will be fine-tuned by way of LLaMA-Manufacturing unit. In addition they report throughput of 0.67 pictures/s and 1.86 PDF pages/s below their analysis setup. As well as, they describe a MaaS API priced at 0.2 RMB per million tokens, with instance price estimates for scanned pictures and simple-layout PDFs. These particulars counsel that GLM-OCR is being framed as each a analysis mannequin and a deployable system.
Key Takeaways
- GLM-OCR is a compact 0.9B multimodal OCR mannequin constructed with a 0.4B CogViT encoder and 0.5B GLM decoder.
- It makes use of Multi-Token Prediction (MTP) to enhance decoding effectivity, reaching 5.2 tokens per step on common and about 50% increased throughput.
- The mannequin makes use of a two-stage pipeline: PP-DocLayout-V3 handles format evaluation, then GLM-OCR performs parallel region-level recognition.
- It helps each doc parsing and KIE: parsing outputs Markdown/JSON, whereas KIE immediately generates JSON from the complete doc picture.
- Benchmark outcomes are sturdy however not common wins: GLM-OCR leads a number of reported non-reference benchmarks, however MinerU 2.5 is increased on PubTabNet, and Gemini-3-Professional is increased on the reference-only KIE scores.
Try Paper, Repo and Mannequin Web page. Additionally, be at liberty to observe us on Twitter and don’t overlook to affix our 120k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be part of us on telegram as properly.

