PaddleOCR-VL-1.5: SOTA Multimodal Document Parsing

Baidu announces PaddleOCR-VL-1.5, a 0.9B VLM achieving 94.5% on OmniDocBench v1.5 with breakthrough robustness in real-world scenarios.

by HowAIWorks Team
PaddleOCRBaiduDocument ParsingVLMOpen Source AIOCRDeep LearningVision-Language ModelDocument AIOmniDocBenchPaddlePaddle

Introduction

On January 2026, Baidu released PaddleOCR-VL-1.5, an advanced next-generation vision-language model (VLM) designed for robust, in-the-wild document parsing. Building on the success of the original PaddleOCR-VL, which established a 0.9B parameter baseline for efficient document understanding, the 1.5 update pushes the boundaries even further. Despite its compact footprint, this model has achieved a groundbreaking 94.5% accuracy on the OmniDocBench v1.5 benchmark, setting a new state-of-the-art (SOTA) for document AI.

While the first version focused on establishing high-performance multilingual support, PaddleOCR-VL-1.5 addresses the most common failures in document processing: physical distortions, irregular layouts, and the "messiness" of real-world captures.

Key Capabilities of PaddleOCR-VL-1.5

PaddleOCR-VL-1.5 is far more than a simple incremental upgrade. It introduces a suite of advanced features designed to bridge the gap between academic benchmarks and real-world deployment challenges:

  • Superior Recognition Accuracy: By refining the underlying vision-language integration, the model achieves a revolutionary 94.5% accuracy on OmniDocBench v1.5. This improvement is particularly visible in complex elements like nested tables and dense mathematical formulas.
  • Multimodal Multi-Tasking: For the first time, PaddleOCR natively supports seal recognition and text spotting (simultaneous localization and recognition of text lines) within the same compact 0.9B architecture.
  • Irregular-Shape Localization: The model introduces support for polygonal detection, allowing it to accurately identify and parse text blocks that are not perfectly rectangular. This is critical for documents that are captured in "the wild" with significant skew or perspective distortion.
  • Coherent Long-Document Parsing: Dealing with multi-page documents often leads to fragmented data. PaddleOCR-VL-1.5 solves this with automatic cross-page table merging and intelligent recognition of paragraph headings that span across page breaks.
  • Broad Linguistic & Script Support: The model strengthens its already impressive multilingual capabilities by improving recognition for ancient texts, rare characters, and specialized layouts (like underlines and checkboxes). It also officially adds support for the Tibetan script and Bengali language.

Real-World Robustness: The Real5-OmniDocBench

One of the most significant contributions of the PaddleOCR-VL-1.5 release is the introduction of the Real5-OmniDocBench benchmark. Most existing OCR benchmarks rely on tidy, digital-born PDFs. However, real-world data is often messy, noisy, and physically distorted.

To address this gap, the Baidu team curated Real5-OmniDocBench, focusing on five "in-the-wild" scenarios that typically cause OCR systems to fail:

  1. Scanning Artifacts: Physical scanning often introduces "salt-and-pepper" noise, color bleeding, or low-contrast backgrounds. PaddleOCR-VL-1.5 is trained to filter this noise and focus on the underlying structure.
  2. Document Skew: Whether it's a hand-held photo or a poorly aligned scan, skewed documents often break layout analysis. The model's support for irregular-shaped localization allows it to "straighten" the context internally for accurate parsing.
  3. Warping and Curvature: Books captured without being flattened or documents that have been folded create curved text lines. Version 1.5 uses advanced polygonal detection to follow these curves precisely.
  4. Screen Photography: Taking photos of computer screens or digital displays often results in moiré patterns and significant glare. The model's visual encoder is robust enough to differentiate between these artifacts and the actual text content.
  5. Challenging Illumination: Strong shadows, over-exposure, or uneven lighting (common in warehouse or mobile environments) are no longer deal-breakers. The model maintains high recognition rates even when parts of the document are poorly lit.

In all five scenarios, PaddleOCR-VL-1.5 consistently sets new SOTA records, demonstrating superior performance compared to both its predecessor and mainstream open-source VLMs.

Performance Benchmarks

The competitive landscape for Vision-Language Models is fierce, but PaddleOCR-VL-1.5 carves out a unique niche by balancing size and performance. On the comprehensive OmniDocBench v1.5 leaderboard, it sets the standard for 1B-class models.

Overall Performance

The model's overall score reflects its ability to synthesize visual information and linguistic context. By achieving 94.5% accuracy, it demonstrates a deep understanding of reading order and document structure.

Specialized Recognition Metrics

  • Mathematical Formulas: Many traditional OCR systems struggle with complex LaTeX-style formulas. PaddleOCR-VL-1.5 achieves SOTA performance here, making it an invaluable tool for digitizing academic and scientific literature.
  • Table Structure Awareness: The model doesn't just recognize text in cells; it understands the hierarchy. This enables it to handle merged cells, multi-line headers, and tables without borders with high fidelity.
Document ElementModel TypeSOTA StatusPerformance Note
Overall ContextCompact VLM (0.9B)NEW SOTASurpasses prior PaddleOCR-VL models
Mathematical FormulasCompact VLM (0.9B)NEW SOTAHigh fidelity LaTeX extraction
Table StructureCompact VLM (0.9B)NEW SOTAHandles irregular/borderless tables
Text SpottingCompact VLM (0.9B)NEW SOTAIntegrated localization-recognition

Real-World Use Cases

The robustness of PaddleOCR-VL-1.5 makes it suitable for a wide variety of industrial and personal applications:

  • Mobile Archive Digitization: Perfect for apps that allow users to "scan" documents using phone cameras where lightning and angles are inconsistent.
  • Logistics and Supply Chain: Quickly reading invoices and shipping labels that may be crumpled or poorly lit in warehouse environments.
  • Ancient Text Preservation: Its improved support for rare characters and ancient scripts makes it a key tool for historians and digital librarians.
  • Financial Document Automation: Handling complex multi-page reports with tables that span across pages, ensuring the data remains linked correctly.

Installation and Usage

Getting started with PaddleOCR-VL-1.5 is straightforward. It requires PaddlePaddle version 3.2.1 or above.

Installation

# Install PaddlePaddle and PaddleOCR
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install -U "paddleocr[doc-parser]"

Note: For macOS users, using Docker is the recommended way to set up the environment.

Basic Usage

CLI Command:

paddleocr doc_parser -i https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png

Python API:

from paddleocr import PaddleOCRVL

# Initialize and predict
pipeline = PaddleOCRVL()
output = pipeline.predict("your_document.png")

# Output results in various formats
for res in output:
    res.print()
    res.save_to_json(save_path="output")
    res.save_to_markdown(save_path="output")

Conclusion

PaddleOCR-VL-1.5 demonstrates that "small but mighty" is the future of specialized AI. By maintaining an ultra-compact 0.9B footprint while pushing the boundaries of accuracy and real-world robustness, Baidu has made professional-grade document parsing accessible for edge deployment and large-scale industrial use.

Whether you are digitizing archives of ancient texts or automating invoice processing from mobile photos, PaddleOCR-VL-1.5 provides the reliability and versatility needed for robust document understanding.

Key Takeaways:

  • 94.5% SOTA Accuracy on OmniDocBench v1.5.
  • Robustness First: Handles physical distortions like warping and skew better than most competing VLMs.
  • Multitasking capabilities including seal recognition and text spotting.
  • Ultra-compact 0.9B size ensures high efficiency and easy deployment.

Sources


Interesting in exploring more about Vision-Language Models? Check out our AI Models Catalog or dive into our Computer Vision Courses. For more on OCR technology, see our Glossary of AI Terms.

Frequently Asked Questions

PaddleOCR-VL-1.5 achieves a new state-of-the-art accuracy of 94.5% on OmniDocBench v1.5, introduces 'Real5-OmniDocBench' for evaluating real-world robustness, and adds support for seal recognition and text spotting.
The model supports irregular-shaped localization, enabling accurate polygonal detection under skewed, warped, or poorly illuminated conditions, outperforming both open-source and proprietary models in these scenarios.
PaddleOCR-VL-1.5 remains an ultra-compact model with 0.9B parameters, maintaining high efficiency for practical deployment while delivering SOTA performance.
In addition to general document parsing, the 1.5 version adds built-in support for seal recognition and text spotting (text-line localization and recognition).
Yes, it supports automatic cross-page table merging and cross-page paragraph heading recognition, which helps mitigate content fragmentation in long documents.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.