RND1: Largest Open Diffusion Language Model Released

Introduction

In October 2025, Radical Numerics announced the release of RND1-Base (Radical Numerics Diffusion), marking a significant breakthrough in diffusion language model (DLM) development. This experimental 30B-parameter sparse Mixture-of-Experts model with 3B active parameters represents the largest and most capable open-source diffusion language model to date.

The model was created through a novel autoregressive-to-diffusion (A2D) conversion approach, starting from a pretrained AR model (Qwen3-30BA3B) and continually pretrained for 500B tokens to achieve full diffusion behavior. Radical Numerics has open-sourced the model alongside their training recipe, inference code, and sample outputs, advancing the field of AI research through transparency and collaboration.

What Are Diffusion Language Models?

Diffusion Language Models (DLMs) represent an alternative paradigm to traditional autoregressive language models. While autoregressive models like GPT generate text sequentially from left to right, DLMs use a diffusion process that:

Support parallel generation: Unlike sequential AR generation, DLMs can generate multiple tokens simultaneously
Bidirectional context: Process information from both directions rather than just causal (left-to-right) attention
Masked prediction: Train by predicting masked tokens across entire sequences
Different generation dynamics: Offer alternative approaches to text generation with potentially unique capabilities

However, training DLMs from scratch has historically been challenging due to scaling inefficiencies compared to autoregressive models. DLMs require more passes over finite datasets to surpass AR training performance, and the mature infrastructure for AR models provides significant advantages.

RND1 Key Features

Large-Scale Architecture

RND1-Base introduces impressive technical specifications:

30B total parameters: Sparse Mixture-of-Experts architecture for efficient scaling
3B active parameters: Only 10% of parameters active per inference for efficiency
Qwen3-30BA3B foundation: Built from strong pretrained autoregressive checkpoint
500B token training: Extensive continual pretraining for full diffusion behavior
Open weights: Fully available on Hugging Face for community use and research

Simple Continual Pretraining (SCP)

Radical Numerics developed a remarkably straightforward approach to A2D conversion that matches or exceeds complex multi-stage pipelines. Simple Continual Pretraining (SCP) represents a breakthrough in model conversion methodology:

Key Innovation: "Simple methods can match complex conversion pipelines" - This finding could accelerate diffusion model research across the industry by lowering technical barriers and reducing development time.

SCP vs. Traditional A2D Approaches

Aspect	Traditional A2D Methods	Simple Continual Pretraining
Conversion stages	Multi-stage (3-5 phases)	Single-stage
Complexity	High (complex scheduling)	Low (straightforward)
Implementation time	Weeks to months	Days to weeks
Mask transition	Gradual annealing	Immediate replacement
Architecture editing	Systematic grafting	Minimal changes
Reproducibility	Difficult (many parameters)	Easy (simple recipe)
Performance	State-of-the-art	Matches or exceeds

Traditional Approaches (Complex):

Attention mask annealing: Gradually relaxes causal mask to enable bidirectionality
Grafting: Systematically edits architectures to swap attention mechanisms
Multi-stage pipelines: Multiple conversion phases with complex scheduling

Simple Continual Pretraining (RND1's Approach):

Start from a strong AR checkpoint
Replace the causal mask with a bidirectional mask at initialization
Continue pretraining under the masked diffusion objective with learning rate warmup

This simpler method proves that sophisticated conversion recipes aren't always necessary, making DLM development more accessible and scalable.

Layer-Specific Learning Rates

To address catastrophic forgetting—where A2D conversion might overwrite factual knowledge from AR pretraining—RND1 employs variable learning rates across parameter groups:

Attention layers: Higher learning rates to promote adaptation to bidirectional context
MLP/FFN layers: Lower learning rates to retain factual knowledge encoded during AR pretraining
Knowledge preservation: Based on research showing that factual associations in Transformer-based LMs are primarily encoded in FFN/MLP layers

This targeted approach preserves the vast linguistic and factual knowledge from trillions of tokens of pretraining while enabling bidirectional understanding.

Optimized for Large Batch Sizes

A critical discovery in RND1's development is that diffusion training thrives with larger batch sizes than traditional AR training:

The Challenge:

AR models: Every token contributes to the loss (100% supervision)
Diffusion models: Only masked positions contribute to learning (~50% supervision on average)
Result: Standard AR batch size heuristics don't transfer to diffusion training

Research Insight: Diffusion models require approximately 2x larger batch sizes than AR models to achieve equivalent learning signal density due to the reduced supervision from masking.

RND1's Solution:

Empirical testing: Critical batch size (CBS) estimation through branched training experiments
Key finding: Diffusion loss continues decreasing up to 8M tokens at 4B parameter scale
Scaling benefit: Demonstrates that DLMs benefit from large batch sizes during continual pretraining
Efficiency signal: Encouraging indication for large-scale training viability

Performance Benchmarks

State-of-the-Art Among Open DLMs

RND1 establishes new performance standards across multiple evaluation frameworks:

Reasoning & Commonsense:

MMLU: Strong performance on massive multitask language understanding
ARC-C: Competitive results on challenging science questions
RACE: High accuracy on reading comprehension tasks
BBH: Solid performance on Big-Bench Hard reasoning tasks

STEM & Mathematics:

GSM8K: Competitive mathematical reasoning capabilities
AIME: Strong results on advanced mathematics problems

Code Generation:

MBPP: Effective code generation on Mostly Basic Python Problems

Overall Performance:

Consistently outperforms prior open diffusion baselines (Dream-7B and LLaDA-8B)
Preserves strong performance from autoregressive foundation
Demonstrates that scaling DLMs beyond 8B parameters is practical
First open-source effort to demonstrate diffusion model training at this scale

Comparison with Baseline Models

RND1's performance relative to existing open-source diffusion models:

Model	Parameters	Type	Performance Level
RND1-Base	30B (3B active)	Diffusion MoE	State-of-the-art
Dream-7B	7B	Diffusion	Baseline
LLaDA-8B	8B	Diffusion	Baseline
Qwen3-30BA3B	30B (3B active)	Autoregressive	Foundation model

The results demonstrate that A2D conversion preserves the strong capabilities of the autoregressive foundation while introducing bidirectional context and parallel generation capabilities.

Technical Innovation: A2D Conversion

The Model Conversion Paradigm

Radical Numerics positions RND1 as part of their broader model conversion research direction:

Core Philosophy:

Optimize models at the level of architecture and training objectives
Avoid rebuilding entire systems from scratch
Enable faster iteration on models
Adapt models to specific workflows, hardware, and downstream tasks

Advantages Over Training From Scratch:

Leverages mature AR training infrastructure and expertise
Preserves knowledge from trillions of tokens of pretraining
More efficient use of computational resources
Faster development and experimentation cycles

Key Research Contributions

Radical Numerics' work on RND1 advances the field through:

Systematic A2D study at scale: Comprehensive investigation of initialization, layer-specific learning rates, and critical batch size
Scalability factors: Identification of elements enabling stability and scaling when combined with AR pretraining methodologies
Largest base DLM: Demonstration that principled AR-to-diffusion conversion produces high performance across benchmarks
Open research: Transparent sharing of training recipes, code, and model weights

Why Model Customization Matters

Efficient AI Research

RND1 exemplifies a new approach to AI development:

No starting from scratch: Building on existing models rather than complete retraining
Faster exploration: Testing new architectures and training paradigms efficiently
Resource efficiency: Better use of computational resources and existing knowledge
Bolder experimentation: Lower barriers to testing innovative ideas

Recursive Self-Improvement Vision

Radical Numerics positions RND1 within their larger mission:

Automated AI Research Platform:

Enable recursive self-improvement in AI systems
Allow AI systems to help design and optimize next-generation AI
Automate experimentation loops for faster search space traversal
Test more ambitious ideas through systematic exploration

Open Research Philosophy:

Sharing models, recipes, and insights multiplies progress
Collaborative development accelerates the entire field
Democratizes access to advanced AI capabilities
Makes customized intelligence for every domain accessible

Industry Impact and Applications

Research Implications

RND1's release has significant implications for AI research:

Diffusion Model Viability:

Proves that large-scale DLMs (30B+) are feasible and practical
Demonstrates A2D conversion as viable alternative to training from scratch
Shows that simple methods can match complex conversion pipelines
Validates model conversion as effective research strategy

Open Science Advancement:

Complete transparency with training recipes and code
All test trajectories published for reproducibility
Community access enables validation and further research
Sets standard for open AI research practices

Practical Applications

Potential use cases for RND1 and similar diffusion models:

Alternative Generation Paradigms:

Parallel text editing: Simultaneous refinement of multiple text segments in document editing applications
Multi-document synthesis: Generating summaries by processing multiple sources bidirectionally
Interactive writing: Real-time text refinement where users can edit any part and regenerate contextually
Iterative content improvement: Progressive refinement of drafts through multiple denoising passes
Constrained generation: Generating text that must satisfy multiple bidirectional constraints simultaneously

Research Platform:

DLM architecture studies: Foundation for exploring different diffusion model architectures and training approaches
Domain adaptation: Base model for fine-tuning on specialized domains (medical, legal, scientific)
Generation paradigm research: Comparing autoregressive vs. diffusion approaches for different task types
Model conversion techniques: Reference implementation demonstrating A2D conversion at scale
Hybrid model development: Starting point for creating models combining AR and diffusion capabilities

Efficient Inference:

Cost-effective deployment: Sparse MoE with only 3B active parameters reduces inference costs by 90%
Edge device potential: Smaller active parameter count enables deployment on resource-constrained hardware
Batch processing optimization: Parallel generation allows efficient processing of multiple sequences
Flexible quality-speed tradeoffs: Adjust number of denoising iterations to balance quality and latency
Specialized hardware utilization: Bidirectional attention patterns map well to certain accelerator architectures

Future Directions

Model Conversion as Research Paradigm

RND1 represents the future of efficient AI research:

Automated Experimentation:

Systematic exploration of architecture and training space
Faster iteration on model improvements
More ambitious experimental designs
AI-assisted AI development

Recursive Improvement:

Models helping to design better models
Automated optimization of training procedures
Self-improving research systems
Accelerated progress through automation

Scaling Beyond 30B

RND1's success at 30B parameters suggests further scaling possibilities:

Larger diffusion models: Path to 100B+ parameter DLMs
Improved efficiency: Better batch size utilization and training dynamics
Enhanced capabilities: Potential for frontier-level diffusion model performance
Practical deployment: MoE architecture enables efficient scaling

Accessing RND1

Open Source Availability

Radical Numerics has made RND1 fully accessible:

Available Resources:

Model weights: Available on Hugging Face Hub
Training code: Complete training recipe and implementation
Inference code: Tools for using RND1 in applications
Sample outputs: Example generations for evaluation
Technical report: Comprehensive documentation of methodology
Test trajectories: All evaluation results for reproducibility

Community Benefits:

Free access for research and experimentation
Reproducible results and transparent methodology
Foundation for further model development
Educational resource for understanding DLMs

Conclusion

Bottom Line: RND1-Base proves that simple, well-designed methods can achieve breakthrough results - sometimes the most elegant solution is the simplest one.

RND1-Base represents a significant milestone in artificial intelligence research, demonstrating that diffusion language models can scale effectively to 30B parameters and beyond. By combining simple continual pretraining with layer-specific learning rates and optimized batch sizing, Radical Numerics has created the most capable open-source diffusion language model to date.

Key Takeaways:

Scale breakthrough: 30B parameter model with 3B active parameters is largest open DLM
Simple approach: SCP method matches complex multi-stage conversion pipelines
Knowledge preservation: Layer-specific learning rates retain AR pretraining knowledge
Batch size optimization: DLMs benefit from larger batches than AR models
State-of-the-art performance: Outperforms existing open diffusion baselines across benchmarks
Open research: Complete transparency with weights, code, and training recipes
Model conversion paradigm: Demonstrates efficiency of adapting existing models over training from scratch

This development highlights that alternative AI architectures like diffusion models are becoming increasingly viable for large-scale language modeling. The combination of model conversion techniques with transparent research practices positions RND1 as both a powerful research tool and a stepping stone toward more advanced AI systems.

Radical Numerics' vision of automated AI research platforms and recursive self-improvement represents an exciting future where AI systems help design and optimize the next generation of AI, accelerating progress across the field through collaborative, open development.

Sources

Radical Numerics - RND1 Technical Report - Official announcement and technical details
RND1 Code Repository - Training recipe and inference code (to be released on GitHub)
RND1 Hugging Face Model Weights - Model weights for research and experimentation (to be released on Hugging Face)

Want to learn more about AI models and their capabilities? Explore our AI models catalog, check out our AI fundamentals courses, or browse our glossary of AI terms for deeper understanding. Learn more about Diffusion Language Models, Simple Continual Pretraining, Mixture-of-Experts architecture, and Foundation Models.