Introduction
High-quality pre-training data is the secret ingredient behind the reasoning capabilities of state-of-the-art Large Language Models (LLMs). While general text data is abundant, mathematical reasoning requires a level of precision and structure that typical web crawlers often fail to capture. Standard parsers frequently mangle complex formulas, and high-value mathematical discussions are often buried under layers of digital noise.
To bridge this gap, OpenBMB has introduced UltraData-Math, a large-scale, high-quality pre-training dataset specifically designed for mathematical reasoning tasks. With over 290 billion tokens, it represents a significant leap forward in how we construct and manage data for AI that thinks logically.
The UltraData-Math Framework
The dataset is built on the UltraData L0-L4 Tiered Data Management Framework, which organizes data into progressive levels of purity and instructional value:
- L0: Raw Data Parsing: Uses a specialized mathematical parser to convert MathML, KaTeX, and AsciiMath into standardized LaTeX format, preserving the structural integrity of complex formulas.
- L1: Heuristic Cleaning: Removes noise through document-level deduplication and rigorous heuristic rules to ensure readability.
- L2: Quality Selection: Employs proprietary large models to score and distill the corpus into a high-value subset containing detailed problem-solving steps and academic discussions.
- L3: Refined Data: The most advanced tier, featuring structured content rewritten as Q&A pairs, multi-turn dialogues, and knowledge-grounded textbooks to maximize "learnability."
Key Features
UltraData-Math stands out due to its systematic approach to data diversity and quality:
- Specialized Parsing: Replaces general-purpose tools like readability with the
UltraData-Math-Parserfor perfect formula extraction. - Model-Based Grading: Uses model-driven quality assessment to separate high-value reasoning from simple lists or unrelated noise.
- Diverse Formats: Includes diverse synthetic data like teacher-student dialogues and multiple rewriting styles (from rigorous competition logic to intuitive science).
- Instructional Density: Focuses on "information density," ensuring that every token contributes to the model's logical growth.
The Core Challenge: Why Math is Hard for AI
Standard web datasets are chaotic. When a model reads a typical news article or a social media post, it deals with natural language patterns it has seen millions of times. However, mathematical reasoning is different. It relies on a precise sequence of logical steps, often represented by specialized notation that is easily corrupted during the data extraction process.
Most general-purpose HTML parsers treat mathematical formulas as plain text or, worse, ignore them entirely. This leads to "formula destruction," where a critical part of an equation is lost, rendering the entire document useless for training. Furthermore, mathematical knowledge is often hierarchical; you cannot understand calculus without mastering algebra. UltraData-Math addresses these challenges by treating mathematical data not just as text, but as a structured knowledge asset.
Impact on Model Development: Beyond Basic Arithmetic
The release of UltraData-Math has significant implications for the development of smaller, more efficient models. By providing high-density, "textbook-quality" data, it allows models like MiniCPM-1.2B to punch far above their weight class in reasoning tasks.
Instead of requiring trillions of general tokens to "infer" mathematical rules, models can now learn from clear, step-by-step reasoning chains (Chain-of-Thought) and interactive dialogues. This "instructional efficiency" is the key to creating capable AI that can run locally on mobile devices or edge hardware without sacrificing logical depth.
Experimental Results
The impact of UltraData-Math is evident in the performance of models trained on its various tiers. When tested on the MiniCPM-1.2B architecture, the dataset yielded impressive gains compared to standard baselines like Nemotron-CC:
| Benchmark | Basline Performance | UltraData-Math Performance | Improvement |
|---|---|---|---|
| MATH500 | 33.40pp | 37.02pp | +3.62pp |
| GSM8K | 58.45pp | 61.79pp | +3.34pp |
These results demonstrate that the L3 refined tier significantly boosts both mathematical reasoning and general instruction-following capabilities.
How to Use
UltraData-Math is integrated into the Hugging Face ecosystem, making it easy to incorporate into your own training pipelines.
from datasets import load_dataset
# Load the L1 scale web corpus
ds_l1 = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L1")
# Load the selected high-quality L2 preview
ds_l2 = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L2-preview")
# Load the L3 refined dialogue data
ds_l3 = load_dataset("openbmb/UltraData-Math", "UltraData-Math-L3-Conversation-Synthetic")
Sources
Conclusion
UltraData-Math is more than just a collection of numbers and symbols; it is a systematic approach to teaching AI how to solve problems. By moving from raw web scrapings to "textbook-quality" refined data, OpenBMB is setting a new standard for open-source mathematical datasets. Whether you are fine-tuning a small model or pre-training the next giant, UltraData-Math provides the structural foundation needed for true reasoning capabilities.