Introduction
The NeurIPS 2025 conference has announced its Best Paper Awards, recognizing seven groundbreaking papers that represent significant advances across multiple areas of machine learning and artificial intelligence. The awards include four best paper recipients (one from the Datasets & Benchmark Track) and three runner-up papers, highlighting research spanning self-supervised reinforcement learning, attention mechanisms for large language models, reasoning capabilities, online learning theory, neural scaling laws, and benchmarking methodologies for language model diversity.
These awards recognize papers that demonstrate exceptional scientific rigor, practical impact, and contributions to the broader machine learning community. The selected papers address some of the most pressing questions in modern AI research, from understanding the diversity limitations of language models to improving the fundamental architectures that power today's AI systems.
The diversity of topics among the awarded papers—from theoretical breakthroughs to practical architectural improvements—demonstrates the vibrant and multifaceted nature of machine learning research. Each paper represents substantial work that advances our understanding of AI systems and their capabilities.
Best Paper Award Winners
Artificial Hivemind: Language Model Diversity and Homogenization
Authors: Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Yejin Choi
This paper addresses a critical concern in modern AI: the homogenization of language model outputs and its potential long-term impact on human creativity and thought diversity. The research introduces Infinity-Chat, a large-scale dataset of 26,000 diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth.
Key contributions:
- First comprehensive taxonomy for characterizing open-ended prompts, comprising 6 top-level categories (including creative content generation and brainstorming) that break down to 17 subcategories
- Large-scale study of mode collapse revealing the "Artificial Hivemind" effect: pronounced intra-model repetition and inter-model homogeneity where different models produce strikingly similar outputs
- 31,250 human annotations across absolute ratings and pairwise preferences, with 25 independent annotations per example
- Critical finding: State-of-the-art LMs, reward models, and LM judges are less well-calibrated to human ratings on generations that elicit differing idiosyncratic annotator preferences
The paper reveals that despite maintaining comparable overall quality, current AI systems struggle with diversity in open-ended generation, raising serious concerns about long-term risks to human creativity, value plurality, and independent thinking. This work establishes a foundation for future research on preserving heterogeneity in AI systems.
Gated Attention for Large Language Models
Authors: Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin
This paper presents a simple yet powerful modification to the standard attention mechanism used in transformer models: applying head-specific sigmoid gates after the Scaled Dot-Product Attention (SDPA) operation.
Key findings:
- Comprehensive evaluation: Tested over 30 variants of gated attention on 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset
- Consistent improvements: The simple gating modification consistently improves performance, training stability, and scaling properties
- Enhanced capabilities: The approach tolerates larger learning rates, mitigates attention sink phenomena, and improves long-context extrapolation performance
- Mechanism understanding: The effectiveness stems from introducing non-linearity upon the low-rank mapping in softmax attention and applying query-dependent sparse gating scores
The authors attribute the success to two key factors: (1) introducing non-linearity upon the low-rank mapping in softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. This modification has been implemented in the Qwen3-Next models, demonstrating its practical value.
The paper represents substantial work possible only with access to industrial-scale computing resources, and the authors' open sharing of results, code, and models advances the community's understanding of attention in large language models.
1000 Layer Networks for Self-Supervised RL
Authors: Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach
This paper challenges conventional assumptions about network depth in reinforcement learning, demonstrating that increasing depth up to 1024 layers can significantly boost performance in self-supervised RL settings.
Key contributions:
- Depth scaling breakthrough: While most RL papers rely on shallow architectures (2-5 layers), this work shows that increasing depth to 1024 layers significantly improves performance
- Unsupervised goal-conditioned setting: Experiments conducted without demonstrations or rewards, requiring agents to explore from scratch and learn to maximize likelihood of reaching commanded goals
- Performance improvements: Increases performance on self-supervised contrastive RL algorithm, outperforming other goal-conditioned baselines
- Qualitative behavior changes: Increasing model depth not only increases success rates but also qualitatively changes the behaviors learned
The research addresses the question of whether RL provides sufficient information to guide numerous parameters in deep networks. By demonstrating that depth scaling works in self-supervised RL, the paper opens new possibilities for scaling RL architectures similar to the breakthroughs seen in language and vision through self-supervised learning.
Runner-Up Papers
RL Does Not Expand Reasoning Capabilities in LLMs
One runner-up paper delivers an important negative finding on a widely accepted assumption: that Reinforcement Learning with Verifiable Rewards (RLVR) elicits genuinely new reasoning capabilities in large language models.
Key findings:
- RLVR enhances sampling efficiency but doesn't expand reasoning capacity beyond what's present in base models
- RL narrows exploration: Rewarded trajectories are amplified, but the broader solution space shrinks
- Optimization within distribution: RLVR optimizes within, rather than beyond, the base distribution
- Distillation insights: Unlike RL, distillation can introduce new reasoning patterns and genuinely expand model capabilities
This finding highlights the need for fundamentally new RL paradigms that can navigate vast action spaces and genuinely expand LLM reasoning capabilities, rather than just improving efficiency within existing capabilities.
Optimal Mistake Bounds for Transductive Online Learning
Authors: Zachary Chase, Steve Hanneke, Shay Moran, Jonathan Shafer
This paper resolves a 30-year-old open problem in learning theory by precisely quantifying the power of unlabeled data in online learning.
Breakthrough results:
- Tight bounds: Proved that for every concept class with Littlestone dimension d, the transductive mistake bound is at least Ω(√d)
- Exponential improvement: This establishes an exponential improvement over previous lower bounds
- Tight matching: Showed the bound is tight with an O(√d) upper bound
- Quadratic gap: Demonstrates a quadratic gap between transductive and standard online learning, highlighting the benefit of advanced access to unlabeled instance sequences
The proof techniques are remarkable, employing sophisticated adversary strategies and innovative hypothesis class constructions that integrate multiple advanced techniques including "Danger Zone Minimization" and "Splitting Experts" via multiplicative weights.
Superposition Yields Robust Neural Scaling
Authors: Yizhou Liu, Ziming Liu, Jeff Gore
This paper moves beyond observing neural scaling laws to demonstrate that representation superposition constitutes the primary mechanism governing these laws.
Key insights:
- Mechanism identification: Representation superposition—where LLMs represent more features than they have dimensions—is a key contributor to loss and causes neural scaling
- Controlled experiments: Using Anthropic's toy model with weight decay to control superposition degree, systematically studied how loss scales with model size
- Strong superposition regime: Under strong superposition, loss generically scales inversely with model dimension across broad frequency distributions
- Real-world confirmation: Confirmed that open-sourced LLMs operate in the strong superposition regime with loss scaling inversely with model dimension
The results identify representation superposition as a central driver of neural scaling laws, providing insights into when neural scaling laws can be improved and when they will break down.
Implications and Future Directions
Diversity and Pluralism in AI Systems
The Artificial Hivemind paper raises critical questions about the long-term societal impact of language model homogenization. As AI systems become more prevalent in content generation, the lack of diversity could:
- Limit creative expression: Reduced diversity in AI-generated content may constrain human creativity
- Affect value plurality: Homogenized outputs may not reflect diverse human values and perspectives
- Impact independent thinking: Repeated exposure to similar outputs could influence human thought patterns
Future research must address these concerns by developing methods to preserve and enhance diversity in AI systems while maintaining quality and alignment with human preferences.
Architectural Improvements for LLMs
The gated attention paper demonstrates that simple architectural modifications can yield significant improvements. This finding:
- Encourages architectural exploration: Simple, well-motivated changes can outperform complex alternatives
- Highlights importance of systematic evaluation: Comprehensive experiments across many variants reveal optimal solutions
- Supports open research: Sharing results, code, and models advances the entire community
The implementation in Qwen3-Next models shows immediate practical value, and we can expect this modification to be widely adopted across the LLM community.
Scaling RL Architectures
The 1000-layer network paper opens new possibilities for scaling reinforcement learning:
- Depth as a scaling dimension: Similar to language and vision, depth can be a key factor in RL performance
- Self-supervised RL potential: Unsupervised goal-conditioned settings may enable new capabilities
- Architecture exploration: RL architectures may benefit from similar depth scaling seen in other domains
This research suggests that RL may be ready for the kind of scaling breakthroughs that transformed language and vision models.
Conclusion
The NeurIPS 2025 Best Paper Awards recognize exceptional contributions that advance our understanding of machine learning and artificial intelligence. From theoretical breakthroughs resolving decades-old problems to practical architectural improvements with immediate applications, these papers demonstrate the breadth and depth of current AI research.
The awarded papers address critical questions about AI system diversity, fundamental architecture improvements, scaling laws, and the limitations of current training paradigms. They provide both insights into how current systems work and directions for future improvements.
As the AI field continues to evolve rapidly, these contributions will influence research directions, architectural choices, and our understanding of what's possible with machine learning. The open sharing of results, code, and models—especially in the gated attention paper—supports the broader research community and enables faster progress.
For researchers, practitioners, and anyone interested in the state of AI research, these papers represent essential reading that will shape the field's development in the coming years. The diversity of topics—from theoretical learning theory to practical model improvements—reflects the multifaceted nature of modern AI research and its broad impact on technology and society.