Qwen3-Max-Thinking: A New Era for Reasoning Models

Alibaba Cloud introduces Qwen3-Max-Thinking, a flagship reasoning model with adaptive tool-use and test-time scaling, rivaling GPT-5.2 and Claude Opus 4.5.

by HowAIWorks Team
QwenLLMReasoning ModelsAI ResearchProprietary ModelsNVIDIAAlibaba Cloud

Qwen3 Max Thinking Banner

Introduction

Alibaba Cloud has unveiled Qwen3-Max-Thinking, its latest flagship reasoning model designed to push the boundaries of artificial intelligence. By scaling up model parameters and leveraging substantial computational resources for reinforcement learning, Qwen3-Max-Thinking achieves significant performance improvements across multiple dimensions, including factual knowledge, complex reasoning, instruction following, alignment with human preferences, and agent capabilities.

On 19 established benchmarks, Qwen3-Max-Thinking demonstrates performance comparable to leading models such as GPT-5.2-Thinking, Claude-Opus-4.5, and Gemini 3 Pro. This release marks a significant milestone in the competitive landscape of large language models, offering a powerful alternative for developers and researchers.

Key Innovations

Qwen3-Max-Thinking introduces two major advancements that set it apart:

1. Adaptive Tool-Use Capabilities

Unlike earlier approaches that required users to manually select tools before each task, Qwen3-Max-Thinking autonomously selects and leverages its built-in Search, Memory, and Code Interpreter capabilities during conversations.

This capability emerges from a focused training process: after initial fine-tuning for tool use, the model underwent further training on diverse tasks using both rule-based and model-based feedback. Empirically, the Search and Memory tools effectively mitigate hallucinations, provide access to real-time information, and enable more personalized responses. The Code Interpreter allows users to execute code snippets and apply computational reasoning to solve complex problems.

2. Advanced Test-Time Scaling Strategies

Test-time scaling refers to techniques that allocate additional computation during inference to improve model performance. Qwen3-Max-Thinking employs an experience-cumulative, multi-round test-time scaling strategy for heavy reasoning tasks.

Instead of simply increasing parallel trajectories (which often yields redundant reasoning), the model limits parallelism and redirects saved computation to iterative self-reflection guided by a "take-experience" mechanism. This mechanism distills key insights from past rounds, allowing the model to avoid re-deriving known conclusions and focus on unresolved uncertainties. This approach consistently outperforms standard parallel sampling, achieving higher context efficiency.

Benchmark Performance

The model shows impressive results across various domains, often surpassing or matching state-of-the-art models.

CapabilityBenchmarkGPT-5.2-ThinkingClaude-Opus-4.5Gemini 3 ProQwen3-Max-Thinking
KnowledgeMMLU-Pro87.489.589.885.7
MMLU-Redux95.095.695.992.8
STEMGPQA92.487.091.987.4
ReasoningLiveCodeBench v687.784.890.785.9
HMMT Feb 2599.4-97.598.0
Agentic CodingSWE Verified80.080.976.275.3
Agentic SearchHLE (w/ tools)45.543.245.849.8
Tool UseTau² Bench80.985.785.482.1

Note: Selected benchmarks from the official report. Scaling strategies (like the "take-experience" mechanism) further boost scores on key reasoning benchmarks like GPQA and HLE.

Develop with Qwen3-Max-Thinking

Qwen3-Max-Thinking is available via the Qwen Chat interface and API. The API is OpenAI-compatible, making it easy to integrate into existing workflows.

Python Example

Here is how you can use Qwen3-Max-Thinking with the OpenAI Python client:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-max-2026-01-23",
    messages=[
      {'role': 'user', 'content': 'Explain the concept of test-time scaling.'}
    ],
    extra_body={"enable_thinking": True}
)

print(completion.choices[0].message)

The model is also compatible with the Anthropic API protocol, allowing seamless integration with tools like Claude Code.

Conclusion

Qwen3-Max-Thinking represents a significant leap forward for Alibaba Cloud's proprietary models, challenging the performance dominance of other closed-source giants. Its focus on adaptive tool use and efficient test-time scaling offers a glimpse into the future of reasoning models—where AI not only generates text but actively thinks, plans, and utilizes tools to solve complex problems.

For developers and researchers, Qwen3-Max-Thinking provides a robust new option for building advanced AI applications, particularly those requiring strong reasoning and agentic capabilities.

Sources

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.