Sample Page Title

March 19, 2026

22

The Baidu Qianfan Group launched Qianfan-OCR, a 4B-parameter end-to-end mannequin designed to unify doc parsing, format evaluation, and doc understanding inside a single vision-language structure. In contrast to conventional multi-stage OCR pipelines that chain separate modules for format detection and textual content recognition, Qianfan-OCR performs direct image-to-Markdown conversion and helps prompt-driven duties like desk extraction and doc query answering.

Structure and Technical Specs

Qianfan-OCR makes use of the multimodal bridging structure from the Qianfan-VL framework. The system consists of three major parts:

Imaginative and prescient Encoder (Qianfan-ViT): Employs an Any Decision design that tiles photos into 448 x 448 patches. It helps variable-resolution inputs as much as 4K, producing as much as 4,096 visible tokens per picture to take care of spatial decision for small fonts and dense textual content.
Cross-Modal Adapter: A light-weight two-layer MLP with GELU activation that tasks visible options into the language mannequin’s embedding house.
Language Mannequin Spine (Qwen3-4B): A 4.0B-parameter mannequin with 36 layers and a local 32K context window. It makes use of Grouped-Question Consideration (GQA) to scale back KV cache reminiscence utilization by 4x.

‘Structure-as-Thought’ Mechanism

The principle function of the mannequin is Structure-as-Thought, an optionally available considering part triggered by <suppose> tokens. Throughout this part, the mannequin generates structured format representations—together with bounding containers, component varieties, and studying order—earlier than producing the ultimate output.

Purposeful Utility: This course of recovers specific format evaluation capabilities (component localization and sort classification) usually misplaced in end-to-end paradigms.
Efficiency Traits: Analysis on OmniDocBench v1.5 signifies that enabling the considering part gives a constant benefit on paperwork with excessive “format label entropy”—these containing heterogeneous components like blended textual content, formulation, and diagrams.
Effectivity: Bounding field coordinates are represented as devoted particular tokens (<COORD_0> to <COORD_999>), lowering considering output size by roughly 50% in comparison with plain digit sequences.

Empirical Efficiency and Benchmarks

Qianfan-OCR was evaluated towards each specialised OCR programs and basic vision-language fashions (VLMs).

Doc Parsing and Basic OCR

The mannequin ranks first amongst end-to-end fashions on a number of key benchmarks:

OmniDocBench v1.5: Achieved a rating of 93.12, surpassing DeepSeek-OCR-v2 (91.09) and Gemini-3 Professional (90.33).
OlmOCR Bench: Scored 79.8, main the end-to-end class.
OCRBench: Achieved a rating of 880, rating first amongst all examined fashions.

On public KIE benchmarks, Qianfan-OCR achieved the best common rating (87.9), outperforming considerably bigger fashions^{^{^{^.}}}

Mannequin	Total Imply (KIE)	OCRBench KIE	Nanonets KIE (F1)
Qianfan-OCR (4B)	87.9	95.0	86.5
Qwen3-4B-VL	83.5	89.0	83.3
Qwen3-VL-235B-A22B	84.2	94.0	83.8
Gemini-3.1-Professional	79.2	96.0	76.1

Doc Understanding

Comparative testing revealed that two-stage OCR+LLM pipelines usually fail on duties requiring spatial reasoning^{. As an example, all examined two-stage programs scored 0.0 on CharXiv benchmarks, because the textual content extraction part discards the visible context (axis relationships, information level positions) crucial for chart interpretation^.}

Deployment and Inference

Inference effectivity was measured in Pages Per Second (PPS) utilizing a single NVIDIA A100 GPU^{^{^{^.}}}

Quantization: With W8A8 (AWQ) quantization, Qianfan-OCR achieved 1.024 PPS, a 2x speedup over the W16A16 baseline with negligible accuracy loss.
Structure Benefit: In contrast to pipeline programs that depend on CPU-based format evaluation—which might develop into a bottleneck—Qianfan-OCR is GPU-centric. This avoids inter-stage processing delays and permits for environment friendly large-batch inference.

Try Paper, Repo and Mannequin on HF. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be part of us on telegram as nicely.

Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and information engineering, Michal excels at remodeling advanced datasets into actionable insights.

Sample Page Title

Structure and Technical Specs

‘Structure-as-Thought’ Mechanism

Empirical Efficiency and Benchmarks

Doc Parsing and Basic OCR

Doc Understanding

Deployment and Inference

Related Articles

The Coming Conflict on Native Black Political Energy

Guide Contracts and E&O Claims

Solana (SOL) Patrons Keep Energetic, Although Resistance Retains Stress Excessive

LEAVE A REPLY Cancel reply

Latest Articles

The Coming Conflict on Native Black Political Energy

Guide Contracts and E&O Claims

Solana (SOL) Patrons Keep Energetic, Although Resistance Retains Stress Excessive

5 TSX Dividend Shares I might Purchase If the TSX Pulls Again

The Supreme Court docket preserves entry to the abortion drug mifepristone, for now

EDITOR PICKS

The Coming Conflict on Native Black Political Energy

Guide Contracts and E&O Claims

Solana (SOL) Patrons Keep Energetic, Although Resistance Retains Stress Excessive

POPULAR POSTS

Qubic’s Mining Pool Attacking Monero Falls Beneath Assault

Feedback on the brand new buying and selling dialog in Metatrader...

What’s nano-texture glass and do I would like it?

POPULAR CATEGORY