Qwen3-TTS Open Sourced: Voice Design and Clone

Alibaba open-sources Qwen3-TTS family with voice design, cloning, and ultra-high-quality speech generation across 10 languages.

by HowAIWorks Team
QwenAlibabaText-to-SpeechTTSVoice CloningSpeech SynthesisOpen SourceAI ModelsMultilingualVoice DesignSpeech GenerationAI Audio

Introduction

Alibaba's Qwen team has open-sourced the Qwen3-TTS family, a comprehensive series of text-to-speech models that offer voice design, voice cloning, ultra-high-quality human-like speech generation, and natural language-based voice control. This release represents a significant advancement in open-source speech synthesis technology, providing developers and users with the most extensive set of speech generation features available.

The Qwen3-TTS family is powered by the innovative Qwen3-TTS-Tokenizer-12Hz multi-codebook speech encoder, which achieves efficient compression and robust representation of speech signals. This technology fully preserves paralinguistic information and acoustic environmental features while enabling high-speed, high-fidelity speech reconstruction through a lightweight non-DiT architecture.

Utilizing Dual-Track modeling, Qwen3-TTS achieves extreme bidirectional streaming generation speeds, where the first audio packet is delivered after processing just a single character. This capability makes it suitable for real-time interactive applications where low latency is critical.

Model Architecture and Technology

Qwen3-TTS-Tokenizer-12Hz

The foundation of Qwen3-TTS is the Qwen3-TTS-Tokenizer-12Hz multi-codebook speech encoder, which represents a significant innovation in speech representation:

  • Efficient compression: Achieves efficient acoustic compression and high-dimensional semantic modeling of speech signals
  • Information preservation: Fully preserves paralinguistic information and acoustic environmental features
  • High-fidelity reconstruction: Enables high-speed, high-fidelity speech reconstruction via a lightweight non-DiT architecture
  • SOTA performance: Achieves state-of-the-art performance across key metrics including PESQ (3.21 wideband, 3.68 narrowband), STOI (0.96), and UTMOS (4.16)

Universal End-to-End Architecture

Qwen3-TTS utilizes a discrete multi-codebook LM architecture that realizes full-information end-to-end speech modeling:

  • Eliminates bottlenecks: Completely bypasses information bottlenecks and cascading errors inherent in traditional LM+DiT schemes
  • Enhanced versatility: Significantly enhances the model's versatility, generation efficiency, and performance ceiling
  • Unified approach: Provides a unified architecture for all speech generation tasks

Dual-Track Streaming Generation

The Dual-Track hybrid streaming generation architecture enables:

  • Single model support: One model supports both streaming and non-streaming generation
  • Ultra-low latency: First audio packet output immediately after a single character input
  • End-to-end latency: As low as 97ms end-to-end synthesis latency
  • Real-time capability: Meets rigorous demands of real-time interactive scenarios

Model Variants and Capabilities

1.7B Models

The 1.7B parameter models deliver peak performance and powerful control capabilities:

Qwen3-TTS-12Hz-1.7B-VoiceDesign

  • Voice design: Performs voice design based on user-provided descriptions
  • Natural language control: Supports free-form descriptions of acoustic attributes, persona, and background
  • Timbre reuse: Allows persistent storage and repeated calls of created timbres
  • Multi-character dialogues: Generates vivid and natural multi-turn, multi-character long-form dialogues

Qwen3-TTS-12Hz-1.7B-CustomVoice

  • Style control: Provides style control over target timbres via user instructions
  • Premium timbres: Supports 9 premium timbres covering various combinations of gender, age, language, and dialect
  • Single-speaker multilingual: Maintains timbre while providing precise style control across languages
  • Instruction following: Achieves 75.4% score on InstructTTS-Eval

Qwen3-TTS-12Hz-1.7B-Base

  • Rapid voice clone: Capable of 3-second rapid voice clone from user audio input
  • Fine-tuning support: Can be used for fine-tuning (FT) other models
  • Cross-lingual cloning: Supports cross-lingual voice clone capabilities reaching SOTA performance

0.6B Models

The 0.6B parameter models offer an ideal balance between performance and efficiency:

Qwen3-TTS-12Hz-0.6B-CustomVoice

  • Premium timbres: Supports 9 premium timbres covering various combinations of gender, age, language, and dialect
  • Efficient performance: Optimized for scenarios requiring lower computational resources

Qwen3-TTS-12Hz-0.6B-Base

  • Rapid voice clone: Capable of 3-second rapid voice clone from user audio input
  • Fine-tuning support: Can be used for fine-tuning other models
  • Efficient deployment: Suitable for resource-constrained environments

Key Features

Powerful Speech Representation

The Qwen3-TTS-Tokenizer-12Hz achieves:

  • Efficient acoustic compression: High-dimensional semantic modeling of speech signals
  • Information preservation: Full preservation of paralinguistic information and acoustic environmental features
  • High-fidelity reconstruction: High-speed, high-fidelity speech reconstruction through lightweight architecture
  • Superior quality: SOTA performance in speaker similarity (0.95), indicating near-lossless speaker information preservation

Intelligent Text Understanding and Voice Control

Qwen3-TTS supports speech generation driven by natural language instructions:

  • Flexible control: Allows flexible control over multi-dimensional acoustic attributes (timbre, emotion, prosody)
  • Semantic integration: Deeply integrates text semantic understanding
  • Adaptive expression: Adaptively adjusts tone, rhythm, and emotional expression based on instructions
  • Natural output: Achieves lifelike "what you imagine is what you hear" output
  • Text robustness: Significantly improved robustness to input text noise

Language and Dialect Support

Qwen3-TTS provides comprehensive language support:

  • 10 mainstream languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian
  • Dialect support: Various dialects including Beijing Dialect, Sichuan Dialect, and others
  • Multilingual capabilities: Single-speaker multilingual generalization with average Word Error Rate (WER) of 2.34%

Performance Benchmarks

Voice Design Performance

In voice design tasks, Qwen3-TTS-VoiceDesign demonstrates:

  • Superior performance: Outperformed the MiniMax-Voice-Design closed-source model in both instruction-following capability and generative expressiveness on the InstructTTS-Eval benchmark
  • Open-source leadership: Significantly leads other open-source models in voice design capabilities

Voice Control Performance

Qwen3-TTS-Instruct shows exceptional capabilities:

  • Multilingual generalization: Single-speaker multilingual generalization with average WER of 2.34%
  • Style control: Maintains timbre while providing precise style control, achieving 75.4% score on InstructTTS-Eval
  • Long-form generation: Exceptional long-form speech generation capabilities with WER of 2.36% (Chinese) and 2.81% (English) during continuous 10-minute synthesis

Voice Clone Performance

Qwen3-TTS-VoiceClone achieves state-of-the-art results:

  • Speech stability: Surpassed MiniMax and SeedTTS in speech stability for both Chinese and English cloning on Seed-tts-eval
  • Multilingual performance: Average WER of 1.835% and speaker similarity of 0.789 on TTS multilingual test set across 10 languages, outperforming MiniMax and ElevenLabs
  • Cross-lingual cloning: Cross-lingual voice clone capabilities reached SOTA, surpassing CosyVoice3

Tokenizer Performance

The Qwen-TTS-Tokenizer demonstrates superior reconstruction quality:

  • PESQ scores: 3.21 (wideband) and 3.68 (narrowband), significantly leading similar tokenizers
  • STOI score: 0.96, demonstrating superior reconstruction quality
  • UTMOS score: 4.16, indicating high perceptual quality
  • Speaker similarity: 0.95, significantly surpassing comparison models, indicating near-lossless speaker information preservation

Use Cases and Applications

Voice Design Applications

Qwen3-TTS enables users to create custom voices through natural language:

  • Character voices: Generate voices for characters in games, animations, and interactive media
  • Brand voices: Create consistent brand voices for marketing and advertising
  • Accessibility: Develop voices for accessibility applications
  • Content creation: Produce diverse voices for podcasts, audiobooks, and multimedia content

Voice Cloning Applications

The rapid 3-second voice cloning capability enables:

  • Personalization: Create personalized voice assistants and avatars
  • Content localization: Clone voices for multilingual content production
  • Accessibility: Preserve voices for individuals with speech impairments
  • Entertainment: Enable voice cloning for entertainment and creative projects

Real-Time Interactive Applications

The ultra-low latency streaming generation supports:

  • Live conversations: Real-time voice interactions in chatbots and virtual assistants
  • Interactive media: Live voice generation for games and interactive experiences
  • Telecommunications: Low-latency voice synthesis for communication systems
  • Accessibility tools: Real-time text-to-speech for assistive technologies

Premium Timbres

Qwen3-TTS includes 9 premium timbres covering various combinations:

  • 苏瑶 Serena: Chinese female voice
  • 福伯 Uncle Fu: Chinese male voice
  • 十三 Vivian: Chinese female voice
  • 艾登 Aiden: English male voice
  • 甜茶 Ryan: English male voice
  • 小野杏 Ono Anna: Japanese female voice
  • 素熙 Sohee: Korean female voice
  • 晓东 Dylan: Chinese Dialect (Beijing Dialect) male voice
  • 程川 Eric: Chinese Dialect (Sichuan Dialect) male voice

These timbres provide diverse options for different use cases, languages, and cultural contexts.

Availability and Access

Open Source Release

The entire Qwen3-TTS multi-codebook model series is now open-sourced:

  • GitHub: Available on GitHub
  • HuggingFace: Available on HuggingFace
  • ModelScope: Available on ModelScope for Chinese users
  • Demos: Interactive demos available on HuggingFace and ModelScope

API Access

Qwen3-TTS is accessible via the Qwen API on Alibaba Cloud, providing:

  • Enterprise integration: Easy integration for enterprise applications
  • Scalable deployment: Cloud-based deployment for scalable applications
  • Managed service: Fully managed service with high availability

Documentation and Resources

Comprehensive resources are available:

  • Research paper: Technical paper available on GitHub
  • Documentation: Detailed documentation for developers
  • Examples: Extensive examples and use cases
  • Community support: Active community support and contributions

Technical Innovations

Multi-Codebook Architecture

The multi-codebook approach enables:

  • Rich representation: Captures complex acoustic features and variations
  • Efficient encoding: Efficient compression while maintaining quality
  • Flexible control: Enables fine-grained control over speech attributes

Non-DiT Architecture

The lightweight non-DiT architecture provides:

  • Speed advantages: Faster generation compared to diffusion-based approaches
  • Quality maintenance: Maintains high fidelity without diffusion overhead
  • Resource efficiency: Lower computational requirements

Contextual Understanding

Strong contextual understanding allows:

  • Semantic adaptation: Adapts tone, rhythm, and emotion based on text semantics
  • Natural expression: Produces more natural and contextually appropriate speech
  • Robustness: Improved handling of complex and noisy input text

Comparison with Other TTS Systems

Qwen3-TTS demonstrates competitive advantages:

  • Versus closed-source models: Outperforms MiniMax-Voice-Design in voice design tasks
  • Versus open-source models: Significantly leads other open-source models across multiple benchmarks
  • Versus commercial solutions: Competitive performance with ElevenLabs and other commercial TTS systems
  • Cross-lingual capabilities: SOTA performance in cross-lingual voice cloning, surpassing CosyVoice3

Conclusion

The open-sourcing of the Qwen3-TTS family represents a significant milestone in text-to-speech technology, providing developers and users with state-of-the-art speech generation capabilities. The combination of voice design, voice cloning, ultra-high-quality generation, and natural language control makes Qwen3-TTS one of the most comprehensive open-source TTS solutions available.

The model's exceptional performance across multiple benchmarks, support for 10 languages, ultra-low latency streaming generation, and comprehensive feature set position it as a leading solution for various applications—from real-time interactive systems to content creation and accessibility tools.

With both 1.7B and 0.6B parameter models available, developers can choose the appropriate balance between performance and efficiency for their specific use cases. The open-source nature of the project, combined with comprehensive documentation and community support, makes Qwen3-TTS accessible to a wide range of developers and researchers.

As artificial intelligence continues to evolve, open-source contributions like Qwen3-TTS play a crucial role in democratizing advanced AI capabilities and enabling innovation across industries. The release of this comprehensive TTS family demonstrates Alibaba's commitment to advancing open-source AI technology and providing valuable tools for the global AI community.

Explore more about Qwen models and text-to-speech technology in our models catalog and glossary.

Sources

Frequently Asked Questions

Qwen3-TTS is a series of powerful speech generation models that support voice design, voice cloning, ultra-high-quality human-like speech generation, and natural language-based voice control. It uses a novel multi-codebook speech encoder and Dual-Track modeling for extreme low-latency streaming generation.
Qwen3-TTS supports 10 mainstream languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian, along with various dialects to meet global application demands.
The Qwen3-TTS family includes two model sizes: 1.7B parameters for peak performance and powerful control capabilities, and 0.6B parameters for an ideal balance between performance and efficiency.
Qwen3-TTS achieves extreme low-latency streaming generation with the first audio packet delivered after processing just a single character. End-to-end synthesis latency can be as low as 97ms, meeting real-time interactive scenario demands.
Voice design allows users to generate customized timbre identities through natural language descriptions. Users can input acoustic attributes, persona descriptions, background information, and other free-form descriptions to create desired voice identities.
Yes, the entire Qwen3-TTS multi-codebook model series is now open-sourced on GitHub and accessible via the Qwen API on Alibaba Cloud.

Continue Your AI Journey

Explore our lessons and glossary to deepen your understanding.