NVIDIA RLP: Reinforcement Learning Pretraining for AI Models

NVIDIA introduces RLP, integrating reinforcement learning into pretraining to teach models to think before predicting, achieving +35% gains with 200B fewer tokens.

by HowAIWorks Team
nvidiareinforcement-learningpretrainingai-modelsreasoningchain-of-thoughtresearchmachine-learningnemotronqwen

Introduction

NVIDIA has introduced RLP (Reinforcement Learning Pretraining), a groundbreaking approach that fundamentally changes how large language models learn to reason. Instead of treating reasoning as an afterthought added during post-training, RLP integrates reinforcement learning directly into the pretraining stage, teaching models to "think before they predict" from the very beginning.

Published on September 30, 2025, by NVIDIA's Advanced Deep Learning Research (ADLR) team, RLP represents a paradigm shift in AI model development. The method rewards models for generating useful chains-of-thought (CoT) that actually improve next-token prediction, creating a verifier-free, dense, and scalable approach to teaching reasoning at the foundation level.

Understanding RLP: How It Works

The Core Concept

RLP treats chain-of-thought generation as an explicit action taken before predicting each next token. Rather than simply predicting the next word in a sequence, the model first generates an internal thought process, then uses that reasoning to make better predictions.

The RLP Process:

  • Step 1: Sample internal thought - The model generates a chain-of-thought about what might come next
  • Step 2: Predict with context - The model predicts the observed token using both the original context and the CoT
  • Step 3: Calculate reward - The model receives a reward based on how much the CoT improved prediction accuracy
  • Step 4: Learn from feedback - The model learns to generate more useful thoughts through reinforcement learning

Verifier-Free Information Gain Reward

Unlike traditional methods that require external verifiers or labeled data, RLP uses a verifier-free information gain reward system:

  • Dense signal - Rewards are assigned at every position where thinking improves prediction
  • Self-supervised - No external verifiers or human annotations needed
  • Scalable - Works on any text corpus, from academic papers to web content
  • Dynamic baseline - Uses an EMA (Exponential Moving Average) baseline for stable training

The reward is calculated as the increase in log-likelihood of the observed token when the chain-of-thought is present compared to a "no-think" baseline. This creates a natural, self-supervised signal that teaches the model when and how to reason effectively.

Performance Results: Qwen3-1.7B-Base

Experimental Setup

NVIDIA tested RLP on Qwen3-1.7B-Base, comparing three models through identical post-training:

  • BASE - Original base model
  • CPT - Compute-matched Continuous Pre-training baseline
  • RLP - Model trained with Reinforcement Learning Pretraining

All three models underwent the same post-training pipeline with Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verified Rewards (RLVR) to ensure fair comparison.

Pretraining Phase Results

During pretraining alone, RLP demonstrated superior performance:

  • +19% improvement over the original base model on average benchmarks
  • +17% improvement over compute-matched CPT baseline
  • Strong generalization across math, science, and reasoning tasks
  • Consistent gains before any post-training applied

Post-Training Performance

The benefits of RLP compound rather than disappear after post-training:

  • +8% relative advantage maintained after full post-training
  • +3 absolute points on science benchmarks over CPT after alignment
  • Durable reasoning foundations that persist through SFT and RLVR
  • Broad generalization beyond math to multiple domains

Key Takeaway

RLP establishes a decisive pretraining advantage that compounds with traditional post-training methods, proving that foundational reasoning capabilities built during pretraining create lasting improvements.

Scaling to Larger Models: Nemotron-Nano-12B-V2

Impressive Efficiency Gains

NVIDIA applied RLP to an intermediate checkpoint of Nemotron-Nano-12B-V2 (trained on 19.8 trillion tokens) for just 250 million additional tokens:

  • Overall average increased from 42.81% to 61.32%
  • +35% relative improvement on average across all benchmarks
  • 200 billion fewer tokens used compared to the base model
  • Cross-architecture generalization demonstrated on different model family

Domain-Specific Improvements

RLP achieved particularly strong results across multiple domains:

Science Reasoning:

  • +23% absolute improvement - Most striking gain across all domains
  • Enhanced multi-step reasoning capabilities
  • Better handling of complex scientific concepts

Math Performance:

  • Moderate improvements in mathematical reasoning
  • Consistent gains across different math benchmarks
  • Improved problem-solving approaches

General Reasoning:

  • Broad improvements across diverse reasoning tasks
  • Better logical inference capabilities
  • Enhanced context understanding

Scaling Insights

The Nemotron results demonstrate that:

  • RLP benefits amplify at scale - Larger models see even stronger improvements
  • Architecture agnostic - Works across different model families (Qwen, Nemotron)
  • Token efficient - Achieves better results with significantly fewer training tokens
  • Production ready - Practical for real-world deployment scenarios

Generalization Across Diverse Corpora

Testing on Six Dataset Types

NVIDIA tested RLP on Qwen3-1.7B-Base across six different corpus families:

  • Academic papers - Scientific and research publications
  • Textbooks - Educational materials across subjects
  • Web crawl - Diverse internet content
  • SFT-style data - Supervised fine-tuning datasets
  • Mixed corpora - Combined dataset types
  • General-purpose - Broad domain coverage

Consistent Performance Gains

RLP demonstrated remarkable consistency:

  • 7-9% average improvements across all corpus types
  • Strongest gains on SFT-style and general-purpose data
  • True cross-domain transfer - Simultaneous improvements across all benchmarks
  • No domain-specific tuning required

Finding Reasoning Everywhere

One of RLP's most impressive characteristics is its ability to find reasoning signals in unexpected places:

  • Web crawl data - Even non-curated internet content provides reasoning opportunities
  • No curation needed - Eliminates costly dataset preparation
  • Data efficiency - Leverages existing pretraining corpora
  • Universal applicability - Works with the same data streams as standard pretraining

This proves that RLP can enhance reasoning ability using ordinary pretraining data, making it truly scalable without requiring expensive, specialized datasets.

Key Advantages of RLP

Scalability

  • Works at pretraining scale - Operates on massive text streams
  • No special datasets required - Uses standard pretraining corpora
  • Architecture agnostic - Generalizes across different model families
  • Size scalable - Benefits increase with larger models

Efficiency

  • Token efficient - Achieves better results with fewer tokens
  • Compute effective - Integrates seamlessly into existing pretraining
  • Time efficient - Single unified training phase instead of multi-stage pipelines
  • Cost effective - Reduces need for expensive post-training data curation

Performance

  • Strong baseline improvements - Significant gains before post-training
  • Compounding benefits - Advantages persist and strengthen through alignment
  • Broad generalization - Improvements across math, science, reasoning, and more
  • Robust gains - Consistent performance across diverse benchmarks

Practical Benefits

  • Verifier-free - No external verification systems needed
  • Dense rewards - Learning signal at every position
  • Self-supervised - No human annotations required
  • Production ready - Practical for real-world deployment

Technical Implementation

Reward Mechanism

RLP calculates rewards by contrasting predictions:

  • With CoT - Model prediction conditioned on chain-of-thought
  • Without CoT - Baseline prediction using EMA model without thinking
  • Information gain - Reward equals improvement in next-token prediction
  • Position-wise credit - Assigns credit wherever thinking helps

Dynamic EMA Baseline

The Exponential Moving Average baseline provides:

  • Stable training - Smooths out reward variance
  • Meaningful comparison - Compares current model to its recent past
  • Adaptive learning - Baseline evolves with model capabilities
  • Credit assignment - Helps identify truly useful reasoning

Integration with Pretraining

RLP augments standard next-token prediction:

  • Seamless integration - Works alongside maximum likelihood training
  • Unified objective - Single training phase combines prediction and reasoning
  • Scalable infrastructure - Uses existing pretraining pipelines
  • Minimal overhead - Efficient implementation for large-scale training

Implications for AI Development

Paradigm Shift in Model Training

RLP challenges the traditional approach to AI model development:

Traditional Approach:

  1. Pretrain on next-token prediction
  2. Fine-tune on supervised data
  3. Add reasoning through post-training RL

RLP Approach:

  1. Pretrain with integrated reasoning from day one
  2. Build foundational reasoning capabilities
  3. Compound benefits through post-training

Future of AI Reasoning

RLP suggests several important directions for AI development:

  • Reasoning as foundation - Treating reasoning as core capability, not add-on
  • Unified training - Single-phase training that combines prediction and reasoning
  • Scalable methods - Approaches that work with ordinary pretraining data
  • Efficient learning - Better results with fewer tokens and less curation

Practical Applications

Models trained with RLP could excel at:

  • Complex problem-solving - Enhanced multi-step reasoning capabilities
  • Scientific reasoning - Improved understanding of scientific concepts
  • Mathematical tasks - Better mathematical problem-solving
  • General reasoning - Stronger logical inference across domains
  • Autonomous agents - More reliable reasoning for agentic AI systems

Research Contributions

Novel Methodology

RLP introduces several innovative concepts:

  • Reinforcement as pretraining - Novel method to integrate RL directly into pretraining at scale
  • Verifier-free rewards - Dense, self-supervised signal without external verification
  • Thinking as action - Treats CoT generation as exploratory action in RL framework
  • Information gain objectives - Uses prediction improvement as natural reward signal

Comprehensive Evaluation

The research includes extensive validation:

  • Multiple model sizes - From 1.7B to 12B parameters
  • Multiple architectures - Qwen and Nemotron families
  • Diverse datasets - Six different corpus types tested
  • Ablation studies - Systematic analysis of key components
  • Post-training analysis - Shows benefits persist through alignment

Open Research

NVIDIA has made the research accessible:

  • Published paper - Detailed methodology and results
  • Code release - Implementation available on GitHub
  • Reproducible results - Clear experimental setup and benchmarks
  • Community contribution - Advancing the field of AI reasoning

Conclusion

NVIDIA's RLP (Reinforcement Learning Pretraining) represents a fundamental rethinking of how we build reasoning capabilities into AI models. By integrating reinforcement learning directly into pretraining, RLP teaches models to think before they predict, creating foundational reasoning abilities that persist and compound through subsequent training stages.

Key Achievements:

  • +19% improvement over base models and +17% over continuous pretraining on Qwen3-1.7B
  • +35% average gain on Nemotron-Nano-12B-V2 using 200B fewer tokens
  • +23% absolute improvement in science reasoning on larger models
  • Consistent generalization across diverse corpora and model architectures
  • Compounding benefits that persist and strengthen through post-training

RLP's verifier-free, dense, and scalable approach makes it practical for real-world deployment while achieving state-of-the-art results. By finding reasoning signals in ordinary pretraining data, RLP eliminates the need for costly dataset curation and establishes a new paradigm where reasoning is a core capability built from the foundation up.

This research opens exciting possibilities for building AI models that naturally integrate reasoning into their prediction processes, potentially leading to more capable, reliable, and efficient AI systems across all domains.

Ready to dive deeper into AI concepts? Explore our AI Fundamentals course to understand the building blocks of modern AI, check out our glossary for key terms like chain-of-thought and reinforcement learning, or visit our models catalog to learn about the latest AI models.

Sources


This article covers groundbreaking research in AI model pretraining. For more cutting-edge AI news and analysis, check out our blog or explore related topics in our prompt engineering guide.

Frequently Asked Questions

RLP (Reinforcement Learning Pretraining) is a method that integrates reinforcement learning directly into the pretraining stage, rewarding models for generating useful chains-of-thought that help predict future tokens. It's verifier-free, dense, and scalable.
RLP achieves +19% improvement over base models and +17% over continuous pretraining on Qwen3-1.7B. On Nemotron-Nano-12B-V2, it achieves +35% average gain with 200B fewer tokens, including +23% absolute improvement in science reasoning.
Unlike traditional RL that's added during post-training, RLP weaves reasoning directly into pretraining by rewarding chains-of-thought based on their value for next-token prediction. This creates foundational reasoning abilities that persist through alignment.
Yes, RLP demonstrates strong generalization across different model architectures (Qwen, Nemotron), model sizes (1.7B to 12B parameters), and diverse corpora including academic papers, textbooks, web crawl, and SFT-style data.
Yes, RLP establishes foundational reasoning capabilities that compound with traditional post-training methods like SFT and RLVR. Models trained with RLP maintain their advantages after post-training, showing +8% relative improvement over models without RLP.
RLP is scalable because it uses verifier-free rewards, requires no special dataset curation, works with ordinary pretraining data, and integrates seamlessly into existing pretraining pipelines without needing external verification systems.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.