NeurIPS 2025 Best Paper Awards: 7 Papers

Introduction

The NeurIPS 2025 conference has announced its Best Paper Awards, recognizing seven groundbreaking papers that represent significant advances across multiple areas of machine learning and artificial intelligence. The awards include four best paper recipients (one from the Datasets & Benchmark Track) and three runner-up papers, highlighting research spanning self-supervised reinforcement learning, attention mechanisms for large language models, reasoning capabilities, online learning theory, neural scaling laws, and benchmarking methodologies for language model diversity.

These awards recognize papers that demonstrate exceptional scientific rigor, practical impact, and contributions to the broader machine learning community. The selected papers address some of the most pressing questions in modern AI research, from understanding the diversity limitations of language models to improving the fundamental architectures that power today's AI systems.

The diversity of topics among the awarded papers—from theoretical breakthroughs to practical architectural improvements—demonstrates the vibrant and multifaceted nature of machine learning research. Each paper represents substantial work that advances our understanding of AI systems and their capabilities.

Best Paper Award Winners

Artificial Hivemind: Language Model Diversity and Homogenization

Authors: Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Yejin Choi

This paper addresses a critical concern in modern AI: the homogenization of language model outputs and its potential long-term impact on human creativity and thought diversity. The research introduces Infinity-Chat, a large-scale dataset of 26,000 diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth.

Key contributions:

First comprehensive taxonomy for characterizing open-ended prompts, comprising 6 top-level categories (including creative content generation and brainstorming) that break down to 17 subcategories
Large-scale study of mode collapse revealing the "Artificial Hivemind" effect: pronounced intra-model repetition and inter-model homogeneity where different models produce strikingly similar outputs
31,250 human annotations across absolute ratings and pairwise preferences, with 25 independent annotations per example
Critical finding: State-of-the-art LMs, reward models, and LM judges are less well-calibrated to human ratings on generations that elicit differing idiosyncratic annotator preferences

The paper reveals that despite maintaining comparable overall quality, current AI systems struggle with diversity in open-ended generation, raising serious concerns about long-term risks to human creativity, value plurality, and independent thinking. This work establishes a foundation for future research on preserving heterogeneity in AI systems.

Gated Attention for Large Language Models

Authors: Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin

This paper presents a simple yet powerful modification to the standard attention mechanism used in transformer models: applying head-specific sigmoid gates after the Scaled Dot-Product Attention (SDPA) operation.

Key findings:

Comprehensive evaluation: Tested over 30 variants of gated attention on 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset
Consistent improvements: The simple gating modification consistently improves performance, training stability, and scaling properties
Enhanced capabilities: The approach tolerates larger learning rates, mitigates attention sink phenomena, and improves long-context extrapolation performance
Mechanism understanding: The effectiveness stems from introducing non-linearity upon the low-rank mapping in softmax attention and applying query-dependent sparse gating scores

The authors attribute the success to two key factors: (1) introducing non-linearity upon the low-rank mapping in softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. This modification has been implemented in the Qwen3-Next models, demonstrating its practical value.

The paper represents substantial work possible only with access to industrial-scale computing resources, and the authors' open sharing of results, code, and models advances the community's understanding of attention in large language models.

1000 Layer Networks for Self-Supervised RL

Authors: Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach

This paper challenges conventional assumptions about network depth in reinforcement learning, demonstrating that increasing depth up to 1024 layers can significantly boost performance in self-supervised RL settings.

Key contributions:

Depth scaling breakthrough: While most RL papers rely on shallow architectures (2-5 layers), this work shows that increasing depth to 1024 layers significantly improves performance
Unsupervised goal-conditioned setting: Experiments conducted without demonstrations or rewards, requiring agents to explore from scratch and learn to maximize likelihood of reaching commanded goals
Performance improvements: Increases performance on self-supervised contrastive RL algorithm, outperforming other goal-conditioned baselines
Qualitative behavior changes: Increasing model depth not only increases success rates but also qualitatively changes the behaviors learned

The research addresses the question of whether RL provides sufficient information to guide numerous parameters in deep networks. By demonstrating that depth scaling works in self-supervised RL, the paper opens new possibilities for scaling RL architectures similar to the breakthroughs seen in language and vision through self-supervised learning.

Runner-Up Papers

RL Does Not Expand Reasoning Capabilities in LLMs

One runner-up paper delivers an important negative finding on a widely accepted assumption: that Reinforcement Learning with Verifiable Rewards (RLVR) elicits genuinely new reasoning capabilities in large language models.

Key findings:

RLVR enhances sampling efficiency but doesn't expand reasoning capacity beyond what's present in base models
RL narrows exploration: Rewarded trajectories are amplified, but the broader solution space shrinks
Optimization within distribution: RLVR optimizes within, rather than beyond, the base distribution
Distillation insights: Unlike RL, distillation can introduce new reasoning patterns and genuinely expand model capabilities

This finding highlights the need for fundamentally new RL paradigms that can navigate vast action spaces and genuinely expand LLM reasoning capabilities, rather than just improving efficiency within existing capabilities.

Optimal Mistake Bounds for Transductive Online Learning

Authors: Zachary Chase, Steve Hanneke, Shay Moran, Jonathan Shafer

This paper resolves a 30-year-old open problem in learning theory by precisely quantifying the power of unlabeled data in online learning.

Breakthrough results:

Tight bounds: Proved that for every concept class with Littlestone dimension d, the transductive mistake bound is at least Ω(√d)
Exponential improvement: This establishes an exponential improvement over previous lower bounds
Tight matching: Showed the bound is tight with an O(√d) upper bound
Quadratic gap: Demonstrates a quadratic gap between transductive and standard online learning, highlighting the benefit of advanced access to unlabeled instance sequences

The proof techniques are remarkable, employing sophisticated adversary strategies and innovative hypothesis class constructions that integrate multiple advanced techniques including "Danger Zone Minimization" and "Splitting Experts" via multiplicative weights.

Superposition Yields Robust Neural Scaling

Authors: Yizhou Liu, Ziming Liu, Jeff Gore

This paper moves beyond observing neural scaling laws to demonstrate that representation superposition constitutes the primary mechanism governing these laws.

Key insights:

Mechanism identification: Representation superposition—where LLMs represent more features than they have dimensions—is a key contributor to loss and causes neural scaling
Controlled experiments: Using Anthropic's toy model with weight decay to control superposition degree, systematically studied how loss scales with model size
Strong superposition regime: Under strong superposition, loss generically scales inversely with model dimension across broad frequency distributions
Real-world confirmation: Confirmed that open-sourced LLMs operate in the strong superposition regime with loss scaling inversely with model dimension

The results identify representation superposition as a central driver of neural scaling laws, providing insights into when neural scaling laws can be improved and when they will break down.

Implications and Future Directions

Diversity and Pluralism in AI Systems

The Artificial Hivemind paper raises critical questions about the long-term societal impact of language model homogenization. As AI systems become more prevalent in content generation, the lack of diversity could:

Limit creative expression: Reduced diversity in AI-generated content may constrain human creativity
Affect value plurality: Homogenized outputs may not reflect diverse human values and perspectives
Impact independent thinking: Repeated exposure to similar outputs could influence human thought patterns

Future research must address these concerns by developing methods to preserve and enhance diversity in AI systems while maintaining quality and alignment with human preferences.

Architectural Improvements for LLMs

The gated attention paper demonstrates that simple architectural modifications can yield significant improvements. This finding:

Encourages architectural exploration: Simple, well-motivated changes can outperform complex alternatives
Highlights importance of systematic evaluation: Comprehensive experiments across many variants reveal optimal solutions
Supports open research: Sharing results, code, and models advances the entire community

The implementation in Qwen3-Next models shows immediate practical value, and we can expect this modification to be widely adopted across the LLM community.

Scaling RL Architectures

The 1000-layer network paper opens new possibilities for scaling reinforcement learning:

Depth as a scaling dimension: Similar to language and vision, depth can be a key factor in RL performance
Self-supervised RL potential: Unsupervised goal-conditioned settings may enable new capabilities
Architecture exploration: RL architectures may benefit from similar depth scaling seen in other domains

This research suggests that RL may be ready for the kind of scaling breakthroughs that transformed language and vision models.

Conclusion

The NeurIPS 2025 Best Paper Awards recognize exceptional contributions that advance our understanding of machine learning and artificial intelligence. From theoretical breakthroughs resolving decades-old problems to practical architectural improvements with immediate applications, these papers demonstrate the breadth and depth of current AI research.

The awarded papers address critical questions about AI system diversity, fundamental architecture improvements, scaling laws, and the limitations of current training paradigms. They provide both insights into how current systems work and directions for future improvements.

As the AI field continues to evolve rapidly, these contributions will influence research directions, architectural choices, and our understanding of what's possible with machine learning. The open sharing of results, code, and models—especially in the gated attention paper—supports the broader research community and enables faster progress.

For researchers, practitioners, and anyone interested in the state of AI research, these papers represent essential reading that will shape the field's development in the coming years. The diversity of topics—from theoretical learning theory to practical model improvements—reflects the multifaceted nature of modern AI research and its broad impact on technology and society.

NeurIPS 2025 Best Paper Awards: 7 Papers

Introduction

Best Paper Award Winners

Artificial Hivemind: Language Model Diversity and Homogenization

Gated Attention for Large Language Models

1000 Layer Networks for Self-Supervised RL

Runner-Up Papers

RL Does Not Expand Reasoning Capabilities in LLMs

Optimal Mistake Bounds for Transductive Online Learning

Superposition Yields Robust Neural Scaling

Implications and Future Directions

Diversity and Pluralism in AI Systems

Architectural Improvements for LLMs

Scaling RL Architectures

Conclusion

Sources

Frequently Asked Questions

What papers won the NeurIPS 2025 Best Paper Awards?

What is the Artificial Hivemind effect discovered in the winning paper?

What breakthrough did the gated attention paper achieve?

How deep were the networks in the self-supervised RL paper?

What was the key finding about RL and reasoning capabilities?

What problem did the online learning theory paper solve?

Related Articles

Nanbeige4.1-3B: A Compact Powerhouse with Strong Reasoning and Agentic Capabilities

Xiaomi-Robotics-0: Scaling VLA Models for Real-Time Robot Control

NVIDIA PersonaPlex: Controlled Full-Duplex Speech AI

Continue Your AI Journey