Definition
Simple Continual Pretraining (SCP) is a straightforward methodology for converting autoregressive (AR) language models into diffusion language models through continued training. Introduced by Radical Numerics in their RND1 research, SCP demonstrates that simple techniques can effectively transform pretrained AR models into diffusion models without complex multi-stage pipelines.
The approach preserves the vast knowledge acquired during autoregressive pretraining on trillions of tokens while introducing bidirectional context processing and diffusion-based generation capabilities. SCP represents a practical solution to the challenge of training large-scale diffusion language models (DLMs) by leveraging existing AR infrastructure and expertise.
How It Works
Simple Continual Pretraining follows a remarkably straightforward three-step process that contrasts with more complex conversion approaches:
The SCP Process
1. Start from Strong AR Checkpoint:
- Begin with a well-trained autoregressive model (e.g., Qwen3-30BA3B)
- Leverage knowledge from trillions of tokens of pretraining
- Utilize mature AR training infrastructure and stability
- Preserve factual associations and linguistic understanding
2. Replace Attention Mask:
- Swap causal (left-to-right) attention mask with bidirectional mask
- Enable the model to attend to tokens in both directions
- Initialize with bidirectional attention at the start of conversion
- No gradual transition or annealing required
3. Continue Pretraining:
- Train with masked diffusion objective
- Use learning rate warmup for training stability
- Continue for substantial token count (e.g., 500B tokens for RND1)
- Monitor convergence to full diffusion behavior
Key Innovation: Simplicity
Traditional A2D (autoregressive-to-diffusion) conversion methods employ complex approaches:
Complex Methods (Not Required):
- Attention mask annealing: Gradually relaxing causal mask over multiple stages
- Grafting: Systematically editing model architectures layer by layer
- Multi-stage pipelines: Multiple conversion phases with complex scheduling
- Transition policies: Sophisticated rules for mask progression
SCP Approach (Proven Effective):
- Single-stage conversion with immediate bidirectional attention
- No complex scheduling or transition rules
- Straightforward implementation and reproducibility
- Matches or exceeds performance of sophisticated methods
Key Concepts
Layer-Specific Learning Rates
A critical component of SCP is using variable learning rates across different parameter groups to prevent catastrophic forgetting:
Attention Layers:
- Higher learning rates (e.g., 2-5x base rate)
- Need to adapt to bidirectional context processing
- Must learn new attention patterns
- Critical for diffusion behavior
MLP/FFN Layers:
- Lower learning rates (e.g., 0.2-0.5x base rate)
- Preserve factual knowledge encoded during AR pretraining
- Maintain linguistic associations and world knowledge
- Based on research showing factual knowledge resides in feed-forward layers
Rationale:
- Research shows factual associations in Transformer models are primarily encoded in FFN/MLP layers
- Attention layers need more plasticity to adapt to bidirectional processing
- This balance preserves AR knowledge while enabling diffusion capabilities
- Prevents catastrophic forgetting of pretrained knowledge
Batch Size Optimization
SCP research revealed important insights about batch sizing for diffusion training:
The Challenge:
- AR models: Every token contributes to the loss
- Diffusion models: Only masked positions (~50% on average) contribute to learning
- Traditional AR batch size heuristics don't transfer to diffusion
SCP Discovery:
- Diffusion models benefit from larger batch sizes than AR models
- Critical batch size (CBS) for DLMs is higher than for AR training
- Empirical testing showed loss decreasing up to 8M tokens at 4B scale
- Larger batches provide more learning signal from masked tokens
Practical Implications:
- Use larger effective batch sizes than typical AR training
- Scale batch size with model size for optimal efficiency
- Consider the reduced supervision signal from masking
- Important for large-scale diffusion model training
Knowledge Preservation
SCP explicitly addresses the challenge of maintaining AR pretraining knowledge:
Catastrophic Forgetting Problem:
- Converting architecture can overwrite learned knowledge from foundation models
- Factual associations may be lost during adaptation
- General capabilities can degrade with aggressive fine-tuning
SCP Solutions:
- Layer-specific learning rates protect knowledge-bearing parameters
- Starting from strong checkpoint provides robust foundation
- Gradual learning rate warmup prevents sudden weight changes
- Continued training on large token counts ensures stable adaptation
Real-World Applications
Model Conversion Research
SCP provides a practical framework for A2D conversion:
Research Benefits:
- Reproducible methodology for converting AR to diffusion models
- Lower barrier to entry for diffusion language model research
- Enables systematic study of conversion factors
- Foundation for understanding diffusion model training
Practical Implementation:
- Clear, documented procedure for conversion
- Open-source code and training recipes available
- Successful demonstration at 30B parameter scale (RND1)
- Applicable to various AR model architectures
Large-Scale DLM Development
SCP enables creation of large diffusion language models:
RND1 Success:
- 30B parameter model (3B active with Mixture-of-Experts)
- Largest and most capable open-source diffusion language model
- State-of-the-art performance among open DLMs
- Demonstrates A2D viability at scale
Efficiency Advantages:
- Leverages existing AR pretraining compute
- Avoids training diffusion models from scratch
- Utilizes mature AR training infrastructure
- More efficient use of computational resources
Alternative Generation Paradigms
SCP-converted models offer different generation capabilities:
Diffusion Benefits:
- Parallel token generation instead of sequential
- Bidirectional context understanding
- Iterative refinement capabilities
- Alternative text generation dynamics for specific use cases
Challenges and Considerations
Training Complexity
Hyperparameter Tuning:
- Determining optimal layer-specific learning rate ratios
- Finding appropriate batch sizes for different model scales
- Balancing training stability with adaptation speed
- Monitoring convergence to diffusion behavior through optimization
Resource Requirements:
- Substantial continued pretraining (e.g., 500B tokens)
- Large batch sizes require significant GPU memory
- Extended training time for full conversion
- Computational costs for large-scale models
Conversion Quality
Preservation vs. Adaptation Trade-off:
- Balancing knowledge retention with new capabilities
- Ensuring bidirectional attention fully develops
- Maintaining performance on downstream tasks
- Avoiding degradation of general capabilities
Validation Challenges:
- Measuring successful conversion to diffusion behavior
- Comparing with original AR model performance
- Evaluating diffusion-specific capabilities
- Benchmarking against other DLMs
Model-Specific Factors
Architecture Considerations:
- Some architectures may convert more easily than others
- Mixture-of-Experts models have unique considerations
- Model size affects optimal conversion parameters
- Attention mechanism design influences conversion success
Related Techniques
Fine-tuning Comparison
While related to fine-tuning, SCP has distinct characteristics:
Similarities:
- Both continue training from pretrained checkpoints
- Both use adapted learning rates
- Both preserve some original model knowledge
- Both require careful hyperparameter selection
Differences:
- SCP: Changes model architecture (attention masks) and training objective
- Fine-tuning: Maintains architecture and adapts to new tasks/data
- SCP: Requires massive continued pretraining (500B+ tokens)
- Fine-tuning: Often uses much smaller datasets
Continual Learning
SCP shares concepts with continual learning:
Shared Principles:
- Preventing catastrophic forgetting
- Balancing plasticity and stability
- Incremental knowledge acquisition
- Learning rate adaptation strategies
Distinctions:
- SCP: One-time architectural conversion process
- Continual learning: Ongoing adaptation to streaming data
- SCP: Focus on architecture transformation
- Continual learning: Focus on continuous task adaptation
Recent Developments (2025)
RND1 Demonstration
Radical Numerics' RND1 release demonstrated SCP effectiveness:
Key Results:
- Successfully converted Qwen3-30BA3B to diffusion model
- Achieved state-of-the-art performance among open DLMs
- Outperformed previous diffusion baselines (Dream-7B, LLaDA-8B)
- Proved simple methods can match complex conversion pipelines
Research Contributions:
- Systematic study of A2D conversion at scale
- Identification of critical factors for stability and scaling
- Complete transparency with open-source release
- Comprehensive documentation of methodology
Industry Impact
Open Research Advancement:
- Full code and training recipes publicly available
- Model weights released on Hugging Face
- Reproducible methodology for community validation
- Foundation for further diffusion language model research
Practical Implications:
- Lower barrier to creating large diffusion models
- Alternative approach to AR-only language modeling
- Enables exploration of diffusion generation benefits through generative AI
- Platform for studying parallel text generation
Future Directions
Methodology Refinements
Optimization Opportunities:
- Automated tuning of layer-specific learning rates
- Dynamic batch size scheduling during conversion
- Improved stopping criteria for conversion completion
- Faster conversion with fewer training tokens
Scaling Beyond 30B
Larger Model Conversion:
- Application to 100B+ parameter models
- Conversion of multimodal AR models to diffusion
- Efficient conversion of sparse MoE architectures
- Scaling strategies for frontier model sizes
Hybrid Approaches
Combined Techniques:
- SCP with progressive architecture modification
- Integration with instruction tuning and RLHF
- Combining with other parameter-efficient methods
- Hybrid AR-diffusion architectures
Automated Conversion
AI-Assisted Optimization:
- Machine learning for hyperparameter selection
- Automated monitoring and adaptation during conversion
- Neural architecture search for optimal conversion strategies
- Self-improving conversion systems
Academic Sources
Foundational Papers
- "RND1: Simple, Scalable AR-to-Diffusion Conversion" - Chandrasegaran et al., Radical Numerics (2025) - Introduction of SCP methodology
- "Locating and Editing Factual Associations in GPT" - Meng et al. (2022) - Knowledge localization in feed-forward layers
- "Knowledge Neurons in Pretrained Transformers" - Dai et al. (2022) - Understanding knowledge encoding
Related Conversion Methods
- "Scaling Diffusion Language Models via Adaptation from Autoregressive Models" - Gong et al. (2024) - Attention mask annealing approach
- "Dream 7B: Diffusion Large Language Models" - Ye et al. (2025) - Alternative conversion methodology
- "Exploring Diffusion Transformer Designs via Grafting" - Chandrasegaran et al. (2025) - Grafting-based conversion
Training Optimization
- "Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training" - Merrill et al. (2025) - Batch size optimization
- "An Empirical Model of Large-Batch Training" - McCandlish et al. (2018) - Critical batch size concepts
Conclusion
Simple Continual Pretraining represents a pragmatic breakthrough in converting autoregressive models to diffusion language models. By demonstrating that straightforward methods can match or exceed complex multi-stage pipelines, SCP lowers the barrier to diffusion model research and enables more efficient exploration of alternative generation paradigms.
The success of RND1—the largest open-source diffusion language model—validates SCP as a viable approach for large-scale model conversion. The methodology's transparency, reproducibility, and proven effectiveness make it a valuable contribution to the AI research community's toolkit for exploring beyond traditional autoregressive generation.
As diffusion language models continue to evolve, SCP provides a practical foundation for researchers and practitioners seeking to leverage existing autoregressive investments while exploring the unique capabilities of diffusion-based generation. The simplicity of the approach belies its effectiveness, embodying the principle that sophisticated results don't always require sophisticated methods.
Learn more about related concepts: Diffusion Language Models, Fine-tuning, Continuous Learning, Pre-trained Models, and Transfer Learning.