Zoom AI Hits 48.1% on Humanity's Last Exam: A New SOTA

Zoom's federated AI approach achieves 48.1% on the rigorous Humanity's Last Exam benchmark, surpassing Google Gemini 3 Pro.

by HowAIWorks Team
AI NewsZoomBenchmarksLLMsHLE

Introduction

In a surprising turn of events in the AI landscape, Zoom has announced a major breakthrough in artificial intelligence performance. On December 10, 2025, Zoom revealed that its AI system achieved a score of 48.1% on "Humanity's Last Exam" (HLE), setting a new state-of-the-art (SOTA) benchmark.

This achievement places Zoom ahead of previous leaders, including Google's Gemini 3 Pro, which held the top spot with 45.8%. As AI models continue to evolve, HLE has emerged as the gold standard for measuring true expert-level reasoning capabilities, making Zoom's performance particularly significant for the industry.

What is Humanity's Last Exam (HLE)?

Humanity's Last Exam is widely regarded as one of the most difficult and comprehensive tests for artificial intelligence. Unlike standard benchmarks that often rely on rote memorization or simple pattern recognition, HLE is designed to evaluate deep understanding and multi-step reasoning.

Key characteristics of the HLE benchmark include:

  • Graduate-Level Difficulty: It consists of approximately 3,000 questions that require expert-level knowledge.
  • Broad Scope: The questions span over 100 distinct academic disciplines.
  • Reasoning First: The questions are crafted to be non-searchable, forcing models to derive answers through logic and synthesis rather than retrieving pre-existing text from the internet.

Historically, human experts score around 90% on this exam, while leading AI models have struggled to crack the 40% barrier until recently.

The Winning Strategy: Federated AI

Zoom's success on the HLE wasn't achieved by simply training a larger model. Instead, the company employed a Federated AI approach. This strategy moves away from the "one model to rule them all" philosophy and instead leverages a collaborative system of multiple models.

Explore-Verify-Federate

At the core of this approach is an agentic workflow described as "explore-verify-federate".

  1. Explore: The system explores multiple reasoning paths to approach a problem.
  2. Verify: It rigorously checks these paths against known constraints and logic.
  3. Federate: The results are synthesized to produce a final, high-confidence answer.

This method allows Zoom's AI to tackle the complex, multi-layered problems found in HLE by effectively "thinking" through them in a way that single-shot inference often fails to do.

Benchmark Results

The new leaderboard standings highlight the rapid competitiveness of the field. Zoom's 48.1% score is a notable leap over the competition.

  • Zoom AI: 48.1%
  • Google Gemini 3 Pro (with tools): 45.8%
  • GPT-5.2 / Claude Opus 4.5: ~30-40% range

While other emerging models like xAI's Grok-4 Heavy are also showing impressive results on separate leaderboards, Zoom's performance on HLE underscores the effectiveness of agentic workflows and federated architectures over raw model size alone.

Real-World Impact: Solving Tomorrow's Challenges Today

For the average Zoom user, this might seem like abstract academic progress, but the implications are practical and immediate. The same underlying technology that powers Zoom's success on HLE is being integrated into Zoom AI Companion.

Updates to the platform include:

  • More Accurate Summaries: Meeting notes will capture nuance and context with greater precision.
  • Action Item Extraction: The AI can better identify complex tasks and assign ownership.
  • Complex Workflow Automation: Agentic capabilities allow for handling multi-step business processes that require reasoning, not just execution.
  • Cross-Platform Retrieval: Enhanced ability to synthesize information from various data sources (chats, emails, docs).

Conclusion

Zoom's record-breaking performance on Humanity's Last Exam serves as a reminder that innovation in AI is not solely the domain of foundational model labs like OpenAI or Google DeepMind. By applying a federated, agentic approach to existing strong models, Zoom has demonstrated that how you use AI is just as important as the model itself.

As we move into 2026, we can expect this trend of specialized, reasoning-focused architectures to dominate the next wave of AI development, bringing us closer to bridging the gap between AI and human expert performance.

Sources

Frequently Asked Questions

HLE is a rigorous AI benchmark consisting of nearly 3,000 graduate-level questions across 100+ academic disciplines, designed to test true reasoning rather than simple pattern matching.
Zoom used a 'federated AI' approach, combing multiple models and an 'explore-verify-federate' agentic workflow to reason through complex problems.
This breakthrough translates to more accurate meeting summaries, better action item extraction, and more capable agentic workflows within Zoom's platform.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.