UltraData-Math: Scaling High-Quality Mathematical Reasoning

OpenBMB releases UltraData-Math, a 290B+ token dataset with a unique tiered grading system to boost LLM performance in complex mathematical tasks.

by HowAIWorks Team
AILLMMathematicsOpenBMBUltraDataMachine LearningDatasetsMiniCPM

Introduction

High-quality pre-training data is the secret ingredient behind the reasoning capabilities of state-of-the-art Large Language Models (LLMs). While general text data is abundant, mathematical reasoning requires a level of precision and structure that typical web crawlers often fail to capture. Standard parsers frequently mangle complex formulas, and high-value mathematical discussions are often buried under layers of digital noise.

To bridge this gap, OpenBMB has introduced UltraData-Math, a large-scale, high-quality pre-training dataset specifically designed for mathematical reasoning tasks. With over 290 billion tokens, it represents a significant leap forward in how we construct and manage data for AI that thinks logically.

The UltraData-Math Framework

The dataset is built on the UltraData L0-L4 Tiered Data Management Framework, which organizes data into progressive levels of purity and instructional value:

  • L0: Raw Data Parsing: Uses a specialized mathematical parser to convert MathML, KaTeX, and AsciiMath into standardized LaTeX format, preserving the structural integrity of complex formulas.
  • L1: Heuristic Cleaning: Removes noise through document-level deduplication and rigorous heuristic rules to ensure readability.
  • L2: Quality Selection: Employs proprietary large models to score and distill the corpus into a high-value subset containing detailed problem-solving steps and academic discussions.
  • L3: Refined Data: The most advanced tier, featuring structured content rewritten as Q&A pairs, multi-turn dialogues, and knowledge-grounded textbooks to maximize "learnability."

Key Features

UltraData-Math stands out due to its systematic approach to data diversity and quality:

  • Specialized Parsing: Replaces general-purpose tools like readability with the UltraData-Math-Parser for perfect formula extraction.
  • Model-Based Grading: Uses model-driven quality assessment to separate high-value reasoning from simple lists or unrelated noise.
  • Diverse Formats: Includes diverse synthetic data like teacher-student dialogues and multiple rewriting styles (from rigorous competition logic to intuitive science).
  • Instructional Density: Focuses on "information density," ensuring that every token contributes to the model's logical growth.

The Core Challenge: Why Math is Hard for AI

Standard web datasets are chaotic. When a model reads a typical news article or a social media post, it deals with natural language patterns it has seen millions of times. However, mathematical reasoning is different. It relies on a precise sequence of logical steps, often represented by specialized notation that is easily corrupted during the data extraction process.

Most general-purpose HTML parsers treat mathematical formulas as plain text or, worse, ignore them entirely. This leads to "formula destruction," where a critical part of an equation is lost, rendering the entire document useless for training. Furthermore, mathematical knowledge is often hierarchical; you cannot understand calculus without mastering algebra. UltraData-Math addresses these challenges by treating mathematical data not just as text, but as a structured knowledge asset.

Impact on Model Development: Beyond Basic Arithmetic

The release of UltraData-Math has significant implications for the development of smaller, more efficient models. By providing high-density, "textbook-quality" data, it allows models like MiniCPM-1.2B to punch far above their weight class in reasoning tasks.

Instead of requiring trillions of general tokens to "infer" mathematical rules, models can now learn from clear, step-by-step reasoning chains (Chain-of-Thought) and interactive dialogues. This "instructional efficiency" is the key to creating capable AI that can run locally on mobile devices or edge hardware without sacrificing logical depth.

Experimental Results

The impact of UltraData-Math is evident in the performance of models trained on its various tiers. When tested on the MiniCPM-1.2B architecture, the dataset yielded impressive gains compared to standard baselines like Nemotron-CC:

BenchmarkBasline PerformanceUltraData-Math PerformanceImprovement
MATH50033.40pp37.02pp+3.62pp
GSM8K58.45pp61.79pp+3.34pp

These results demonstrate that the L3 refined tier significantly boosts both mathematical reasoning and general instruction-following capabilities.

How to Use

UltraData-Math is integrated into the Hugging Face ecosystem, making it easy to incorporate into your own training pipelines.

from datasets import load_dataset

# Load the L1 scale web corpus
ds_l1 = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L1")

# Load the selected high-quality L2 preview
ds_l2 = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L2-preview")

# Load the L3 refined dialogue data
ds_l3 = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L3-Conversation-Synthetic")

Sources

Conclusion

UltraData-Math is more than just a collection of numbers and symbols; it is a systematic approach to teaching AI how to solve problems. By moving from raw web scrapings to "textbook-quality" refined data, OpenBMB is setting a new standard for open-source mathematical datasets. Whether you are fine-tuning a small model or pre-training the next giant, UltraData-Math provides the structural foundation needed for true reasoning capabilities.

Frequently Asked Questions

Unlike general parsers, UltraData-Math uses a specialized 'UltraData-Math-Parser' that preserves LaTeX formatting and employs a four-tier (L0-L3) quality grading system.
The total dataset contains over 290 billion tokens across its three accessible levels: L1 (Web Corpus), L2 (Selected Data), and L3 (Refined Data).
Yes, the dataset is available on Hugging Face under the OpenBMB organization for the research and development community.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.