Simple Continual Pretraining (SCP)

Straightforward method for converting autoregressive models to diffusion models through continued training with bidirectional attention.

model conversionpretrainingdiffusion modelsA2Dcontinual learningtransformerstraining techniques

Definition

Simple Continual Pretraining (SCP) is a straightforward methodology for converting autoregressive (AR) language models into diffusion language models through continued training. Introduced by Radical Numerics in their RND1 research, SCP demonstrates that simple techniques can effectively transform pretrained AR models into diffusion models without complex multi-stage pipelines.

The approach preserves the vast knowledge acquired during autoregressive pretraining on trillions of tokens while introducing bidirectional context processing and diffusion-based generation capabilities. SCP represents a practical solution to the challenge of training large-scale diffusion language models (DLMs) by leveraging existing AR infrastructure and expertise.

How It Works

Simple Continual Pretraining follows a remarkably straightforward three-step process that contrasts with more complex conversion approaches:

The SCP Process

1. Start from Strong AR Checkpoint:

  • Begin with a well-trained autoregressive model (e.g., Qwen3-30BA3B)
  • Leverage knowledge from trillions of tokens of pretraining
  • Utilize mature AR training infrastructure and stability
  • Preserve factual associations and linguistic understanding

2. Replace Attention Mask:

  • Swap causal (left-to-right) attention mask with bidirectional mask
  • Enable the model to attend to tokens in both directions
  • Initialize with bidirectional attention at the start of conversion
  • No gradual transition or annealing required

3. Continue Pretraining:

  • Train with masked diffusion objective
  • Use learning rate warmup for training stability
  • Continue for substantial token count (e.g., 500B tokens for RND1)
  • Monitor convergence to full diffusion behavior

Key Innovation: Simplicity

Traditional A2D (autoregressive-to-diffusion) conversion methods employ complex approaches:

Complex Methods (Not Required):

  • Attention mask annealing: Gradually relaxing causal mask over multiple stages
  • Grafting: Systematically editing model architectures layer by layer
  • Multi-stage pipelines: Multiple conversion phases with complex scheduling
  • Transition policies: Sophisticated rules for mask progression

SCP Approach (Proven Effective):

  • Single-stage conversion with immediate bidirectional attention
  • No complex scheduling or transition rules
  • Straightforward implementation and reproducibility
  • Matches or exceeds performance of sophisticated methods

Key Concepts

Layer-Specific Learning Rates

A critical component of SCP is using variable learning rates across different parameter groups to prevent catastrophic forgetting:

Attention Layers:

  • Higher learning rates (e.g., 2-5x base rate)
  • Need to adapt to bidirectional context processing
  • Must learn new attention patterns
  • Critical for diffusion behavior

MLP/FFN Layers:

  • Lower learning rates (e.g., 0.2-0.5x base rate)
  • Preserve factual knowledge encoded during AR pretraining
  • Maintain linguistic associations and world knowledge
  • Based on research showing factual knowledge resides in feed-forward layers

Rationale:

  • Research shows factual associations in Transformer models are primarily encoded in FFN/MLP layers
  • Attention layers need more plasticity to adapt to bidirectional processing
  • This balance preserves AR knowledge while enabling diffusion capabilities
  • Prevents catastrophic forgetting of pretrained knowledge

Batch Size Optimization

SCP research revealed important insights about batch sizing for diffusion training:

The Challenge:

  • AR models: Every token contributes to the loss
  • Diffusion models: Only masked positions (~50% on average) contribute to learning
  • Traditional AR batch size heuristics don't transfer to diffusion

SCP Discovery:

  • Diffusion models benefit from larger batch sizes than AR models
  • Critical batch size (CBS) for DLMs is higher than for AR training
  • Empirical testing showed loss decreasing up to 8M tokens at 4B scale
  • Larger batches provide more learning signal from masked tokens

Practical Implications:

  • Use larger effective batch sizes than typical AR training
  • Scale batch size with model size for optimal efficiency
  • Consider the reduced supervision signal from masking
  • Important for large-scale diffusion model training

Knowledge Preservation

SCP explicitly addresses the challenge of maintaining AR pretraining knowledge:

Catastrophic Forgetting Problem:

  • Converting architecture can overwrite learned knowledge from foundation models
  • Factual associations may be lost during adaptation
  • General capabilities can degrade with aggressive fine-tuning

SCP Solutions:

  • Layer-specific learning rates protect knowledge-bearing parameters
  • Starting from strong checkpoint provides robust foundation
  • Gradual learning rate warmup prevents sudden weight changes
  • Continued training on large token counts ensures stable adaptation

Real-World Applications

Model Conversion Research

SCP provides a practical framework for A2D conversion:

Research Benefits:

  • Reproducible methodology for converting AR to diffusion models
  • Lower barrier to entry for diffusion language model research
  • Enables systematic study of conversion factors
  • Foundation for understanding diffusion model training

Practical Implementation:

  • Clear, documented procedure for conversion
  • Open-source code and training recipes available
  • Successful demonstration at 30B parameter scale (RND1)
  • Applicable to various AR model architectures

Large-Scale DLM Development

SCP enables creation of large diffusion language models:

RND1 Success:

  • 30B parameter model (3B active with Mixture-of-Experts)
  • Largest and most capable open-source diffusion language model
  • State-of-the-art performance among open DLMs
  • Demonstrates A2D viability at scale

Efficiency Advantages:

  • Leverages existing AR pretraining compute
  • Avoids training diffusion models from scratch
  • Utilizes mature AR training infrastructure
  • More efficient use of computational resources

Alternative Generation Paradigms

SCP-converted models offer different generation capabilities:

Diffusion Benefits:

  • Parallel token generation instead of sequential
  • Bidirectional context understanding
  • Iterative refinement capabilities
  • Alternative text generation dynamics for specific use cases

Challenges and Considerations

Training Complexity

Hyperparameter Tuning:

  • Determining optimal layer-specific learning rate ratios
  • Finding appropriate batch sizes for different model scales
  • Balancing training stability with adaptation speed
  • Monitoring convergence to diffusion behavior through optimization

Resource Requirements:

  • Substantial continued pretraining (e.g., 500B tokens)
  • Large batch sizes require significant GPU memory
  • Extended training time for full conversion
  • Computational costs for large-scale models

Conversion Quality

Preservation vs. Adaptation Trade-off:

  • Balancing knowledge retention with new capabilities
  • Ensuring bidirectional attention fully develops
  • Maintaining performance on downstream tasks
  • Avoiding degradation of general capabilities

Validation Challenges:

  • Measuring successful conversion to diffusion behavior
  • Comparing with original AR model performance
  • Evaluating diffusion-specific capabilities
  • Benchmarking against other DLMs

Model-Specific Factors

Architecture Considerations:

Related Techniques

Fine-tuning Comparison

While related to fine-tuning, SCP has distinct characteristics:

Similarities:

  • Both continue training from pretrained checkpoints
  • Both use adapted learning rates
  • Both preserve some original model knowledge
  • Both require careful hyperparameter selection

Differences:

  • SCP: Changes model architecture (attention masks) and training objective
  • Fine-tuning: Maintains architecture and adapts to new tasks/data
  • SCP: Requires massive continued pretraining (500B+ tokens)
  • Fine-tuning: Often uses much smaller datasets

Continual Learning

SCP shares concepts with continual learning:

Shared Principles:

  • Preventing catastrophic forgetting
  • Balancing plasticity and stability
  • Incremental knowledge acquisition
  • Learning rate adaptation strategies

Distinctions:

  • SCP: One-time architectural conversion process
  • Continual learning: Ongoing adaptation to streaming data
  • SCP: Focus on architecture transformation
  • Continual learning: Focus on continuous task adaptation

Recent Developments (2025)

RND1 Demonstration

Radical Numerics' RND1 release demonstrated SCP effectiveness:

Key Results:

  • Successfully converted Qwen3-30BA3B to diffusion model
  • Achieved state-of-the-art performance among open DLMs
  • Outperformed previous diffusion baselines (Dream-7B, LLaDA-8B)
  • Proved simple methods can match complex conversion pipelines

Research Contributions:

  • Systematic study of A2D conversion at scale
  • Identification of critical factors for stability and scaling
  • Complete transparency with open-source release
  • Comprehensive documentation of methodology

Industry Impact

Open Research Advancement:

  • Full code and training recipes publicly available
  • Model weights released on Hugging Face
  • Reproducible methodology for community validation
  • Foundation for further diffusion language model research

Practical Implications:

  • Lower barrier to creating large diffusion models
  • Alternative approach to AR-only language modeling
  • Enables exploration of diffusion generation benefits through generative AI
  • Platform for studying parallel text generation

Future Directions

Methodology Refinements

Optimization Opportunities:

  • Automated tuning of layer-specific learning rates
  • Dynamic batch size scheduling during conversion
  • Improved stopping criteria for conversion completion
  • Faster conversion with fewer training tokens

Scaling Beyond 30B

Larger Model Conversion:

  • Application to 100B+ parameter models
  • Conversion of multimodal AR models to diffusion
  • Efficient conversion of sparse MoE architectures
  • Scaling strategies for frontier model sizes

Hybrid Approaches

Combined Techniques:

  • SCP with progressive architecture modification
  • Integration with instruction tuning and RLHF
  • Combining with other parameter-efficient methods
  • Hybrid AR-diffusion architectures

Automated Conversion

AI-Assisted Optimization:

  • Machine learning for hyperparameter selection
  • Automated monitoring and adaptation during conversion
  • Neural architecture search for optimal conversion strategies
  • Self-improving conversion systems

Academic Sources

Foundational Papers

  • "RND1: Simple, Scalable AR-to-Diffusion Conversion" - Chandrasegaran et al., Radical Numerics (2025) - Introduction of SCP methodology
  • "Locating and Editing Factual Associations in GPT" - Meng et al. (2022) - Knowledge localization in feed-forward layers
  • "Knowledge Neurons in Pretrained Transformers" - Dai et al. (2022) - Understanding knowledge encoding

Related Conversion Methods

  • "Scaling Diffusion Language Models via Adaptation from Autoregressive Models" - Gong et al. (2024) - Attention mask annealing approach
  • "Dream 7B: Diffusion Large Language Models" - Ye et al. (2025) - Alternative conversion methodology
  • "Exploring Diffusion Transformer Designs via Grafting" - Chandrasegaran et al. (2025) - Grafting-based conversion

Training Optimization

Conclusion

Simple Continual Pretraining represents a pragmatic breakthrough in converting autoregressive models to diffusion language models. By demonstrating that straightforward methods can match or exceed complex multi-stage pipelines, SCP lowers the barrier to diffusion model research and enables more efficient exploration of alternative generation paradigms.

The success of RND1—the largest open-source diffusion language model—validates SCP as a viable approach for large-scale model conversion. The methodology's transparency, reproducibility, and proven effectiveness make it a valuable contribution to the AI research community's toolkit for exploring beyond traditional autoregressive generation.

As diffusion language models continue to evolve, SCP provides a practical foundation for researchers and practitioners seeking to leverage existing autoregressive investments while exploring the unique capabilities of diffusion-based generation. The simplicity of the approach belies its effectiveness, embodying the principle that sophisticated results don't always require sophisticated methods.


Learn more about related concepts: Diffusion Language Models, Fine-tuning, Continuous Learning, Pre-trained Models, and Transfer Learning.

Frequently Asked Questions

SCP is a straightforward method for converting autoregressive (AR) models to diffusion models by replacing the causal attention mask with bidirectional attention and continuing pretraining with the masked diffusion objective. It was introduced by Radical Numerics for creating diffusion language models.
Unlike complex multi-stage approaches like attention mask annealing or grafting, SCP uses a simple three-step process: start from an AR checkpoint, replace the causal mask with bidirectional mask, and continue pretraining. This simplicity matches or exceeds the performance of more sophisticated methods.
SCP has three main components: (1) starting from a strong autoregressive checkpoint, (2) replacing causal attention with bidirectional attention at initialization, and (3) continuing pretraining with masked diffusion objective and learning rate warmup.
Layer-specific learning rates prevent catastrophic forgetting during conversion. Attention layers receive higher learning rates to adapt to bidirectional context, while MLP/FFN layers get lower rates to preserve factual knowledge from AR pretraining.
The most notable example is RND1-Base by Radical Numerics, a 30B parameter diffusion language model converted from Qwen3-30BA3B. RND1 is the largest and most capable open-source diffusion language model to date.

Continue Learning

Explore our lessons and prompts to deepen your AI knowledge.