RND1: Largest Open Diffusion Language Model Released

Radical Numerics introduces RND1-Base, a 30B parameter diffusion language model converted from autoregressive architecture with 15% efficiency gains.

by HowAIWorks Team
aidiffusion-modelsrnd1radical-numericsai-modelsopen-sourcelanguage-modelsmachine-learningmodel-conversionartificial-intelligence

Introduction

In October 2025, Radical Numerics announced the release of RND1-Base (Radical Numerics Diffusion), marking a significant breakthrough in diffusion language model (DLM) development. This experimental 30B-parameter sparse Mixture-of-Experts model with 3B active parameters represents the largest and most capable open-source diffusion language model to date.

The model was created through a novel autoregressive-to-diffusion (A2D) conversion approach, starting from a pretrained AR model (Qwen3-30BA3B) and continually pretrained for 500B tokens to achieve full diffusion behavior. Radical Numerics has open-sourced the model alongside their training recipe, inference code, and sample outputs, advancing the field of AI research through transparency and collaboration.

What Are Diffusion Language Models?

Diffusion Language Models (DLMs) represent an alternative paradigm to traditional autoregressive language models. While autoregressive models like GPT generate text sequentially from left to right, DLMs use a diffusion process that:

  • Support parallel generation: Unlike sequential AR generation, DLMs can generate multiple tokens simultaneously
  • Bidirectional context: Process information from both directions rather than just causal (left-to-right) attention
  • Masked prediction: Train by predicting masked tokens across entire sequences
  • Different generation dynamics: Offer alternative approaches to text generation with potentially unique capabilities

However, training DLMs from scratch has historically been challenging due to scaling inefficiencies compared to autoregressive models. DLMs require more passes over finite datasets to surpass AR training performance, and the mature infrastructure for AR models provides significant advantages.

RND1 Key Features

Large-Scale Architecture

RND1-Base introduces impressive technical specifications:

  • 30B total parameters: Sparse Mixture-of-Experts architecture for efficient scaling
  • 3B active parameters: Only 10% of parameters active per inference for efficiency
  • Qwen3-30BA3B foundation: Built from strong pretrained autoregressive checkpoint
  • 500B token training: Extensive continual pretraining for full diffusion behavior
  • Open weights: Fully available on Hugging Face for community use and research

Simple Continual Pretraining (SCP)

Radical Numerics developed a remarkably straightforward approach to A2D conversion that matches or exceeds complex multi-stage pipelines. Simple Continual Pretraining (SCP) represents a breakthrough in model conversion methodology:

Key Innovation: "Simple methods can match complex conversion pipelines" - This finding could accelerate diffusion model research across the industry by lowering technical barriers and reducing development time.

SCP vs. Traditional A2D Approaches

AspectTraditional A2D MethodsSimple Continual Pretraining
Conversion stagesMulti-stage (3-5 phases)Single-stage
ComplexityHigh (complex scheduling)Low (straightforward)
Implementation timeWeeks to monthsDays to weeks
Mask transitionGradual annealingImmediate replacement
Architecture editingSystematic graftingMinimal changes
ReproducibilityDifficult (many parameters)Easy (simple recipe)
PerformanceState-of-the-artMatches or exceeds

Traditional Approaches (Complex):

  • Attention mask annealing: Gradually relaxes causal mask to enable bidirectionality
  • Grafting: Systematically edits architectures to swap attention mechanisms
  • Multi-stage pipelines: Multiple conversion phases with complex scheduling

Simple Continual Pretraining (RND1's Approach):

  1. Start from a strong AR checkpoint
  2. Replace the causal mask with a bidirectional mask at initialization
  3. Continue pretraining under the masked diffusion objective with learning rate warmup

This simpler method proves that sophisticated conversion recipes aren't always necessary, making DLM development more accessible and scalable.

Layer-Specific Learning Rates

To address catastrophic forgetting—where A2D conversion might overwrite factual knowledge from AR pretraining—RND1 employs variable learning rates across parameter groups:

  • Attention layers: Higher learning rates to promote adaptation to bidirectional context
  • MLP/FFN layers: Lower learning rates to retain factual knowledge encoded during AR pretraining
  • Knowledge preservation: Based on research showing that factual associations in Transformer-based LMs are primarily encoded in FFN/MLP layers

This targeted approach preserves the vast linguistic and factual knowledge from trillions of tokens of pretraining while enabling bidirectional understanding.

Optimized for Large Batch Sizes

A critical discovery in RND1's development is that diffusion training thrives with larger batch sizes than traditional AR training:

The Challenge:

  • AR models: Every token contributes to the loss (100% supervision)
  • Diffusion models: Only masked positions contribute to learning (~50% supervision on average)
  • Result: Standard AR batch size heuristics don't transfer to diffusion training

Research Insight: Diffusion models require approximately 2x larger batch sizes than AR models to achieve equivalent learning signal density due to the reduced supervision from masking.

RND1's Solution:

  • Empirical testing: Critical batch size (CBS) estimation through branched training experiments
  • Key finding: Diffusion loss continues decreasing up to 8M tokens at 4B parameter scale
  • Scaling benefit: Demonstrates that DLMs benefit from large batch sizes during continual pretraining
  • Efficiency signal: Encouraging indication for large-scale training viability

Performance Benchmarks

State-of-the-Art Among Open DLMs

RND1 establishes new performance standards across multiple evaluation frameworks:

Reasoning & Commonsense:

  • MMLU: Strong performance on massive multitask language understanding
  • ARC-C: Competitive results on challenging science questions
  • RACE: High accuracy on reading comprehension tasks
  • BBH: Solid performance on Big-Bench Hard reasoning tasks

STEM & Mathematics:

  • GSM8K: Competitive mathematical reasoning capabilities
  • AIME: Strong results on advanced mathematics problems

Code Generation:

  • MBPP: Effective code generation on Mostly Basic Python Problems

Overall Performance:

  • Consistently outperforms prior open diffusion baselines (Dream-7B and LLaDA-8B)
  • Preserves strong performance from autoregressive foundation
  • Demonstrates that scaling DLMs beyond 8B parameters is practical
  • First open-source effort to demonstrate diffusion model training at this scale

Comparison with Baseline Models

RND1's performance relative to existing open-source diffusion models:

ModelParametersTypePerformance Level
RND1-Base30B (3B active)Diffusion MoEState-of-the-art
Dream-7B7BDiffusionBaseline
LLaDA-8B8BDiffusionBaseline
Qwen3-30BA3B30B (3B active)AutoregressiveFoundation model

The results demonstrate that A2D conversion preserves the strong capabilities of the autoregressive foundation while introducing bidirectional context and parallel generation capabilities.

Technical Innovation: A2D Conversion

The Model Conversion Paradigm

Radical Numerics positions RND1 as part of their broader model conversion research direction:

Core Philosophy:

  • Optimize models at the level of architecture and training objectives
  • Avoid rebuilding entire systems from scratch
  • Enable faster iteration on models
  • Adapt models to specific workflows, hardware, and downstream tasks

Advantages Over Training From Scratch:

  • Leverages mature AR training infrastructure and expertise
  • Preserves knowledge from trillions of tokens of pretraining
  • More efficient use of computational resources
  • Faster development and experimentation cycles

Key Research Contributions

Radical Numerics' work on RND1 advances the field through:

  1. Systematic A2D study at scale: Comprehensive investigation of initialization, layer-specific learning rates, and critical batch size
  2. Scalability factors: Identification of elements enabling stability and scaling when combined with AR pretraining methodologies
  3. Largest base DLM: Demonstration that principled AR-to-diffusion conversion produces high performance across benchmarks
  4. Open research: Transparent sharing of training recipes, code, and model weights

Why Model Customization Matters

Efficient AI Research

RND1 exemplifies a new approach to AI development:

  • No starting from scratch: Building on existing models rather than complete retraining
  • Faster exploration: Testing new architectures and training paradigms efficiently
  • Resource efficiency: Better use of computational resources and existing knowledge
  • Bolder experimentation: Lower barriers to testing innovative ideas

Recursive Self-Improvement Vision

Radical Numerics positions RND1 within their larger mission:

Automated AI Research Platform:

  • Enable recursive self-improvement in AI systems
  • Allow AI systems to help design and optimize next-generation AI
  • Automate experimentation loops for faster search space traversal
  • Test more ambitious ideas through systematic exploration

Open Research Philosophy:

  • Sharing models, recipes, and insights multiplies progress
  • Collaborative development accelerates the entire field
  • Democratizes access to advanced AI capabilities
  • Makes customized intelligence for every domain accessible

Industry Impact and Applications

Research Implications

RND1's release has significant implications for AI research:

Diffusion Model Viability:

  • Proves that large-scale DLMs (30B+) are feasible and practical
  • Demonstrates A2D conversion as viable alternative to training from scratch
  • Shows that simple methods can match complex conversion pipelines
  • Validates model conversion as effective research strategy

Open Science Advancement:

  • Complete transparency with training recipes and code
  • All test trajectories published for reproducibility
  • Community access enables validation and further research
  • Sets standard for open AI research practices

Practical Applications

Potential use cases for RND1 and similar diffusion models:

Alternative Generation Paradigms:

  • Parallel text editing: Simultaneous refinement of multiple text segments in document editing applications
  • Multi-document synthesis: Generating summaries by processing multiple sources bidirectionally
  • Interactive writing: Real-time text refinement where users can edit any part and regenerate contextually
  • Iterative content improvement: Progressive refinement of drafts through multiple denoising passes
  • Constrained generation: Generating text that must satisfy multiple bidirectional constraints simultaneously

Research Platform:

  • DLM architecture studies: Foundation for exploring different diffusion model architectures and training approaches
  • Domain adaptation: Base model for fine-tuning on specialized domains (medical, legal, scientific)
  • Generation paradigm research: Comparing autoregressive vs. diffusion approaches for different task types
  • Model conversion techniques: Reference implementation demonstrating A2D conversion at scale
  • Hybrid model development: Starting point for creating models combining AR and diffusion capabilities

Efficient Inference:

  • Cost-effective deployment: Sparse MoE with only 3B active parameters reduces inference costs by 90%
  • Edge device potential: Smaller active parameter count enables deployment on resource-constrained hardware
  • Batch processing optimization: Parallel generation allows efficient processing of multiple sequences
  • Flexible quality-speed tradeoffs: Adjust number of denoising iterations to balance quality and latency
  • Specialized hardware utilization: Bidirectional attention patterns map well to certain accelerator architectures

Future Directions

Model Conversion as Research Paradigm

RND1 represents the future of efficient AI research:

Automated Experimentation:

  • Systematic exploration of architecture and training space
  • Faster iteration on model improvements
  • More ambitious experimental designs
  • AI-assisted AI development

Recursive Improvement:

  • Models helping to design better models
  • Automated optimization of training procedures
  • Self-improving research systems
  • Accelerated progress through automation

Scaling Beyond 30B

RND1's success at 30B parameters suggests further scaling possibilities:

  • Larger diffusion models: Path to 100B+ parameter DLMs
  • Improved efficiency: Better batch size utilization and training dynamics
  • Enhanced capabilities: Potential for frontier-level diffusion model performance
  • Practical deployment: MoE architecture enables efficient scaling

Accessing RND1

Open Source Availability

Radical Numerics has made RND1 fully accessible:

Available Resources:

  • Model weights: Available on Hugging Face Hub
  • Training code: Complete training recipe and implementation
  • Inference code: Tools for using RND1 in applications
  • Sample outputs: Example generations for evaluation
  • Technical report: Comprehensive documentation of methodology
  • Test trajectories: All evaluation results for reproducibility

Community Benefits:

  • Free access for research and experimentation
  • Reproducible results and transparent methodology
  • Foundation for further model development
  • Educational resource for understanding DLMs

Conclusion

Bottom Line: RND1-Base proves that simple, well-designed methods can achieve breakthrough results - sometimes the most elegant solution is the simplest one.

RND1-Base represents a significant milestone in artificial intelligence research, demonstrating that diffusion language models can scale effectively to 30B parameters and beyond. By combining simple continual pretraining with layer-specific learning rates and optimized batch sizing, Radical Numerics has created the most capable open-source diffusion language model to date.

Key Takeaways:

  • Scale breakthrough: 30B parameter model with 3B active parameters is largest open DLM
  • Simple approach: SCP method matches complex multi-stage conversion pipelines
  • Knowledge preservation: Layer-specific learning rates retain AR pretraining knowledge
  • Batch size optimization: DLMs benefit from larger batches than AR models
  • State-of-the-art performance: Outperforms existing open diffusion baselines across benchmarks
  • Open research: Complete transparency with weights, code, and training recipes
  • Model conversion paradigm: Demonstrates efficiency of adapting existing models over training from scratch

This development highlights that alternative AI architectures like diffusion models are becoming increasingly viable for large-scale language modeling. The combination of model conversion techniques with transparent research practices positions RND1 as both a powerful research tool and a stepping stone toward more advanced AI systems.

Radical Numerics' vision of automated AI research platforms and recursive self-improvement represents an exciting future where AI systems help design and optimize the next generation of AI, accelerating progress across the field through collaborative, open development.

Sources

  • Radical Numerics - RND1 Technical Report - Official announcement and technical details
  • RND1 Code Repository - Training recipe and inference code (to be released on GitHub)
  • RND1 Hugging Face Model Weights - Model weights for research and experimentation (to be released on Hugging Face)

Want to learn more about AI models and their capabilities? Explore our AI models catalog, check out our AI fundamentals courses, or browse our glossary of AI terms for deeper understanding. Learn more about Diffusion Language Models, Simple Continual Pretraining, Mixture-of-Experts architecture, and Foundation Models.

Frequently Asked Questions

RND1-Base is a 30B parameter sparse Mixture-of-Experts diffusion language model (DLM) with 3B active parameters. It's the largest and most capable open-source diffusion model to date, converted from Qwen3-30BA3B through simple continual pretraining.
Unlike autoregressive (AR) models that generate text sequentially left-to-right, diffusion language models support parallel generation. This allows for potentially faster inference and different generation capabilities while maintaining quality.
A2D (Autoregressive-to-Diffusion) conversion is a process that transforms pretrained autoregressive models into diffusion models through continued training. This preserves the knowledge from trillions of tokens of AR pretraining while introducing bidirectional context.
SCP is Radical Numerics' straightforward approach to A2D conversion: start from a strong AR checkpoint, replace the causal mask with bidirectional mask, and continue pretraining with the masked diffusion objective and learning rate warmup.
RND1 sets new state-of-the-art performance among open diffusion language models, consistently outperforming Dream-7B and LLaDA-8B while preserving strong performance across reasoning, STEM, and coding benchmarks.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.