PaddleOCR-VL-1.5: SOTA Multimodal Document Parsing

Introduction

On January 2026, Baidu released PaddleOCR-VL-1.5, an advanced next-generation vision-language model (VLM) designed for robust, in-the-wild document parsing. Building on the success of the original PaddleOCR-VL, which established a 0.9B parameter baseline for efficient document understanding, the 1.5 update pushes the boundaries even further. Despite its compact footprint, this model has achieved a groundbreaking 94.5% accuracy on the OmniDocBench v1.5 benchmark, setting a new state-of-the-art (SOTA) for document AI.

While the first version focused on establishing high-performance multilingual support, PaddleOCR-VL-1.5 addresses the most common failures in document processing: physical distortions, irregular layouts, and the "messiness" of real-world captures.

Key Capabilities of PaddleOCR-VL-1.5

PaddleOCR-VL-1.5 is far more than a simple incremental upgrade. It introduces a suite of advanced features designed to bridge the gap between academic benchmarks and real-world deployment challenges:

Superior Recognition Accuracy: By refining the underlying vision-language integration, the model achieves a revolutionary 94.5% accuracy on OmniDocBench v1.5. This improvement is particularly visible in complex elements like nested tables and dense mathematical formulas.
Multimodal Multi-Tasking: For the first time, PaddleOCR natively supports seal recognition and text spotting (simultaneous localization and recognition of text lines) within the same compact 0.9B architecture.
Irregular-Shape Localization: The model introduces support for polygonal detection, allowing it to accurately identify and parse text blocks that are not perfectly rectangular. This is critical for documents that are captured in "the wild" with significant skew or perspective distortion.
Coherent Long-Document Parsing: Dealing with multi-page documents often leads to fragmented data. PaddleOCR-VL-1.5 solves this with automatic cross-page table merging and intelligent recognition of paragraph headings that span across page breaks.
Broad Linguistic & Script Support: The model strengthens its already impressive multilingual capabilities by improving recognition for ancient texts, rare characters, and specialized layouts (like underlines and checkboxes). It also officially adds support for the Tibetan script and Bengali language.

Real-World Robustness: The Real5-OmniDocBench

One of the most significant contributions of the PaddleOCR-VL-1.5 release is the introduction of the Real5-OmniDocBench benchmark. Most existing OCR benchmarks rely on tidy, digital-born PDFs. However, real-world data is often messy, noisy, and physically distorted.

To address this gap, the Baidu team curated Real5-OmniDocBench, focusing on five "in-the-wild" scenarios that typically cause OCR systems to fail:

Scanning Artifacts: Physical scanning often introduces "salt-and-pepper" noise, color bleeding, or low-contrast backgrounds. PaddleOCR-VL-1.5 is trained to filter this noise and focus on the underlying structure.
Document Skew: Whether it's a hand-held photo or a poorly aligned scan, skewed documents often break layout analysis. The model's support for irregular-shaped localization allows it to "straighten" the context internally for accurate parsing.
Warping and Curvature: Books captured without being flattened or documents that have been folded create curved text lines. Version 1.5 uses advanced polygonal detection to follow these curves precisely.
Screen Photography: Taking photos of computer screens or digital displays often results in moiré patterns and significant glare. The model's visual encoder is robust enough to differentiate between these artifacts and the actual text content.
Challenging Illumination: Strong shadows, over-exposure, or uneven lighting (common in warehouse or mobile environments) are no longer deal-breakers. The model maintains high recognition rates even when parts of the document are poorly lit.

In all five scenarios, PaddleOCR-VL-1.5 consistently sets new SOTA records, demonstrating superior performance compared to both its predecessor and mainstream open-source VLMs.

Performance Benchmarks

The competitive landscape for Vision-Language Models is fierce, but PaddleOCR-VL-1.5 carves out a unique niche by balancing size and performance. On the comprehensive OmniDocBench v1.5 leaderboard, it sets the standard for 1B-class models.

Overall Performance

The model's overall score reflects its ability to synthesize visual information and linguistic context. By achieving 94.5% accuracy, it demonstrates a deep understanding of reading order and document structure.

Specialized Recognition Metrics

Mathematical Formulas: Many traditional OCR systems struggle with complex LaTeX-style formulas. PaddleOCR-VL-1.5 achieves SOTA performance here, making it an invaluable tool for digitizing academic and scientific literature.
Table Structure Awareness: The model doesn't just recognize text in cells; it understands the hierarchy. This enables it to handle merged cells, multi-line headers, and tables without borders with high fidelity.

Document Element	Model Type	SOTA Status	Performance Note
Overall Context	Compact VLM (0.9B)	NEW SOTA	Surpasses prior PaddleOCR-VL models
Mathematical Formulas	Compact VLM (0.9B)	NEW SOTA	High fidelity LaTeX extraction
Table Structure	Compact VLM (0.9B)	NEW SOTA	Handles irregular/borderless tables
Text Spotting	Compact VLM (0.9B)	NEW SOTA	Integrated localization-recognition

Real-World Use Cases

The robustness of PaddleOCR-VL-1.5 makes it suitable for a wide variety of industrial and personal applications:

Mobile Archive Digitization: Perfect for apps that allow users to "scan" documents using phone cameras where lightning and angles are inconsistent.
Logistics and Supply Chain: Quickly reading invoices and shipping labels that may be crumpled or poorly lit in warehouse environments.
Ancient Text Preservation: Its improved support for rare characters and ancient scripts makes it a key tool for historians and digital librarians.
Financial Document Automation: Handling complex multi-page reports with tables that span across pages, ensuring the data remains linked correctly.

Installation and Usage

Getting started with PaddleOCR-VL-1.5 is straightforward. It requires PaddlePaddle version 3.2.1 or above.

Installation

# Install PaddlePaddle and PaddleOCR
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install -U "paddleocr[doc-parser]"

Note: For macOS users, using Docker is the recommended way to set up the environment.

Basic Usage

CLI Command:

paddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png

Python API:

from paddleocr import PaddleOCRVL

# Initialize and predict
pipeline = PaddleOCRVL()
output = pipeline.predict("your_document.png")

# Output results in various formats
for res in output:
    res.print()
    res.save_to_json(save_path="output")
    res.save_to_markdown(save_path="output")

Conclusion

PaddleOCR-VL-1.5 demonstrates that "small but mighty" is the future of specialized AI. By maintaining an ultra-compact 0.9B footprint while pushing the boundaries of accuracy and real-world robustness, Baidu has made professional-grade document parsing accessible for edge deployment and large-scale industrial use.

Whether you are digitizing archives of ancient texts or automating invoice processing from mobile photos, PaddleOCR-VL-1.5 provides the reliability and versatility needed for robust document understanding.

Key Takeaways:

94.5% SOTA Accuracy on OmniDocBench v1.5.
Robustness First: Handles physical distortions like warping and skew better than most competing VLMs.
Multitasking capabilities including seal recognition and text spotting.
Ultra-compact 0.9B size ensures high efficiency and easy deployment.

Sources

Interesting in exploring more about Vision-Language Models? Check out our AI Models Catalog or dive into our Computer Vision Courses. For more on OCR technology, see our Glossary of AI Terms.

PaddleOCR-VL-1.5: SOTA Multimodal Document Parsing

Introduction

Key Capabilities of PaddleOCR-VL-1.5

Real-World Robustness: The Real5-OmniDocBench

Performance Benchmarks

Overall Performance

Specialized Recognition Metrics

Real-World Use Cases

Installation and Usage

Installation

Basic Usage

Conclusion

Key Takeaways:

Sources

Frequently Asked Questions

What is new in PaddleOCR-VL-1.5?

How does PaddleOCR-VL-1.5 handle real-world distortions?

What is the parameter size of PaddleOCR-VL-1.5?

What are the new tasks supported by PaddleOCR-VL-1.5?

Does it support long-document parsing?

Related Articles

Embedded Language Flows: MIT Revitalizes Text Diffusion

Qwen-Scope: Alibaba's Open 'X-Ray' for Model Interpretability

MIT Deep Learning Fall 2024 Course Released for Free

Continue Your AI Journey