Qwen3-ASR: SOTA Multilingual Speech Recognition and Forced Alignment

Alibaba's Qwen team releases Qwen3-ASR and Qwen3-ForcedAligner, setting new benchmarks in multilingual speech-to-text and precise timestamping.

by HowAIWorks Team
Qwen3-ASRSpeech RecognitionOpen Source AIMultilingual AIForced AlignmentAlibaba QwenASR BenchmarksAudio Intelligence

Introduction

Alibaba’s Qwen team has officially unveiled Qwen3-ASR, a significant leap forward in open-source speech-to-text technology. This release includes the Qwen3-ASR-1.7B and 0.6B models, alongside the Qwen3-ForcedAligner, providing a comprehensive toolkit for everything from real-time transcription to hyper-precise audio alignment.

Built on the foundations of the Qwen3-Omni architecture, these models aren't just incremental updates. They represent a shift toward "all-in-one" speech intelligence, capable of handling 52 languages and dialects, identifying speakers, and maintaining high accuracy even in the presence of noise or background music.

Key Features of the Qwen3-ASR Family

Qwen3-ASR is designed to be a versatile powerhouse for modern AI applications. Its architecture focuses on several core pillars:

  • All-in-One Multilingual Support: A single model handles 30 major languages and 22 regional Chinese dialects, eliminating the need for separate models for language identification.
  • Robustness in Chaos: Whether it's heavy background noise, low speech quality, or complex linguistic patterns like heavy accents and singing, Qwen3-ASR maintains a low Word Error Rate (WER).
  • Extreme Efficiency: The 0.6B variant is built for scale, capable of transcribing 2,000 seconds of speech in just one second at high concurrency, with sub-100ms time-to-first-token.
  • Flexible Inference Options: Native support for vLLM-based batch inference, asynchronous serving, and streaming modes makes it ready for both production backends and edge devices.

Performance Benchmarks

In technical evaluations, the 1.7B model has established itself as the new state-of-the-art (SOTA) among open-source speech models.

Multilingual Recognition (WER %)

When compared against industry heavyweights like Whisper-large-v3, Qwen3-ASR-1.7B shows a clear advantage across various datasets:

BenchmarkQwen3-ASR-1.7BWhisper-large-v3GLM-ASR-Nano
MLS (Average)8.558.6213.32
CommonVoice9.1810.7719.40
Fleurs-124.905.2716.08

Beyond raw numbers, Qwen3-ASR excels in English accented speech (covering accents from 16 countries) and Chinese regional dialects, where it significantly narrows the gap between standard Mandarin and local variants like Cantonese or Fujianese.

Precision Timing with Qwen3-ForcedAligner

For developers working on subtitles, karaoke apps, or video editing tools, the Qwen3-ForcedAligner-0.6B is perhaps the most exciting part of this release. Unlike autoregressive models that can struggle with long-form audio, this non-autoregressive (NAR) model provides rock-solid timestamp accuracy.

Evaluation results show that it outperforms established tools like WhisperX and Nemo-Forced-Aligner (NFA), particularly on long, concatenated audio files up to 5 minutes in length. It ensures that every word is mapped to the exact millisecond it was spoken.

Conclusion

The release of Qwen3-ASR and the Forced Aligner marks a new era for open-weights speech technology. By combining state-of-the-art accuracy with production-ready efficiency and a massive multilingual reach, Alibaba is providing the building blocks for the next generation of voice-enabled agents and accessibility tools.

Whether you are building a real-time translation service or a high-precision subtitle generator, the Qwen3-ASR family offers a level of performance that was previously only available via expensive, commercial cloud APIs.

Sources

Frequently Asked Questions

Qwen3-ASR is a family of all-in-one multilingual speech recognition models (1.7B and 0.6B) developed by Alibaba's Qwen team, supporting 52 languages and dialects.
The 1.7B model achieves state-of-the-art performance, consistently outperforming Whisper-large-v3 and even commercial APIs like GPT-4o in multilingual and accented benchmarks.
It is a non-autoregressive (NAR) model designed for high-precision speech-to-text timestamping, outperforming existing tools like WhisperX.
Yes, both the 1.7B and 0.6B models support streaming mode for real-time applications, with the 0.6B model optimized for ultra-low latency.
Yes, Qwen3-ASR features robust performance in transcribing songs even with heavy background music, achieving low Word Error Rates (WER).

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.