Qwen3-TTS Open Sourced: Voice Design and Clone

Introduction

Alibaba's Qwen team has open-sourced the Qwen3-TTS family, a comprehensive series of text-to-speech models that offer voice design, voice cloning, ultra-high-quality human-like speech generation, and natural language-based voice control. This release represents a significant advancement in open-source speech synthesis technology, providing developers and users with the most extensive set of speech generation features available.

The Qwen3-TTS family is powered by the innovative Qwen3-TTS-Tokenizer-12Hz multi-codebook speech encoder, which achieves efficient compression and robust representation of speech signals. This technology fully preserves paralinguistic information and acoustic environmental features while enabling high-speed, high-fidelity speech reconstruction through a lightweight non-DiT architecture.

Utilizing Dual-Track modeling, Qwen3-TTS achieves extreme bidirectional streaming generation speeds, where the first audio packet is delivered after processing just a single character. This capability makes it suitable for real-time interactive applications where low latency is critical.

Model Architecture and Technology

Qwen3-TTS-Tokenizer-12Hz

The foundation of Qwen3-TTS is the Qwen3-TTS-Tokenizer-12Hz multi-codebook speech encoder, which represents a significant innovation in speech representation:

Efficient compression: Achieves efficient acoustic compression and high-dimensional semantic modeling of speech signals
Information preservation: Fully preserves paralinguistic information and acoustic environmental features
High-fidelity reconstruction: Enables high-speed, high-fidelity speech reconstruction via a lightweight non-DiT architecture
SOTA performance: Achieves state-of-the-art performance across key metrics including PESQ (3.21 wideband, 3.68 narrowband), STOI (0.96), and UTMOS (4.16)

Universal End-to-End Architecture

Qwen3-TTS utilizes a discrete multi-codebook LM architecture that realizes full-information end-to-end speech modeling:

Eliminates bottlenecks: Completely bypasses information bottlenecks and cascading errors inherent in traditional LM+DiT schemes
Enhanced versatility: Significantly enhances the model's versatility, generation efficiency, and performance ceiling
Unified approach: Provides a unified architecture for all speech generation tasks

Dual-Track Streaming Generation

The Dual-Track hybrid streaming generation architecture enables:

Single model support: One model supports both streaming and non-streaming generation
Ultra-low latency: First audio packet output immediately after a single character input
End-to-end latency: As low as 97ms end-to-end synthesis latency
Real-time capability: Meets rigorous demands of real-time interactive scenarios

Model Variants and Capabilities

1.7B Models

The 1.7B parameter models deliver peak performance and powerful control capabilities:

Qwen3-TTS-12Hz-1.7B-VoiceDesign

Voice design: Performs voice design based on user-provided descriptions
Natural language control: Supports free-form descriptions of acoustic attributes, persona, and background
Timbre reuse: Allows persistent storage and repeated calls of created timbres
Multi-character dialogues: Generates vivid and natural multi-turn, multi-character long-form dialogues

Qwen3-TTS-12Hz-1.7B-CustomVoice

Style control: Provides style control over target timbres via user instructions
Premium timbres: Supports 9 premium timbres covering various combinations of gender, age, language, and dialect
Single-speaker multilingual: Maintains timbre while providing precise style control across languages
Instruction following: Achieves 75.4% score on InstructTTS-Eval

Qwen3-TTS-12Hz-1.7B-Base

Rapid voice clone: Capable of 3-second rapid voice clone from user audio input
Fine-tuning support: Can be used for fine-tuning (FT) other models
Cross-lingual cloning: Supports cross-lingual voice clone capabilities reaching SOTA performance

0.6B Models

The 0.6B parameter models offer an ideal balance between performance and efficiency:

Qwen3-TTS-12Hz-0.6B-CustomVoice

Premium timbres: Supports 9 premium timbres covering various combinations of gender, age, language, and dialect
Efficient performance: Optimized for scenarios requiring lower computational resources

Qwen3-TTS-12Hz-0.6B-Base

Rapid voice clone: Capable of 3-second rapid voice clone from user audio input
Fine-tuning support: Can be used for fine-tuning other models
Efficient deployment: Suitable for resource-constrained environments

Key Features

Powerful Speech Representation

The Qwen3-TTS-Tokenizer-12Hz achieves:

Efficient acoustic compression: High-dimensional semantic modeling of speech signals
Information preservation: Full preservation of paralinguistic information and acoustic environmental features
High-fidelity reconstruction: High-speed, high-fidelity speech reconstruction through lightweight architecture
Superior quality: SOTA performance in speaker similarity (0.95), indicating near-lossless speaker information preservation

Intelligent Text Understanding and Voice Control

Qwen3-TTS supports speech generation driven by natural language instructions:

Flexible control: Allows flexible control over multi-dimensional acoustic attributes (timbre, emotion, prosody)
Semantic integration: Deeply integrates text semantic understanding
Adaptive expression: Adaptively adjusts tone, rhythm, and emotional expression based on instructions
Natural output: Achieves lifelike "what you imagine is what you hear" output
Text robustness: Significantly improved robustness to input text noise

Language and Dialect Support

Qwen3-TTS provides comprehensive language support:

10 mainstream languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian
Dialect support: Various dialects including Beijing Dialect, Sichuan Dialect, and others
Multilingual capabilities: Single-speaker multilingual generalization with average Word Error Rate (WER) of 2.34%

Performance Benchmarks

Voice Design Performance

In voice design tasks, Qwen3-TTS-VoiceDesign demonstrates:

Superior performance: Outperformed the MiniMax-Voice-Design closed-source model in both instruction-following capability and generative expressiveness on the InstructTTS-Eval benchmark
Open-source leadership: Significantly leads other open-source models in voice design capabilities

Voice Control Performance

Qwen3-TTS-Instruct shows exceptional capabilities:

Multilingual generalization: Single-speaker multilingual generalization with average WER of 2.34%
Style control: Maintains timbre while providing precise style control, achieving 75.4% score on InstructTTS-Eval
Long-form generation: Exceptional long-form speech generation capabilities with WER of 2.36% (Chinese) and 2.81% (English) during continuous 10-minute synthesis

Voice Clone Performance

Qwen3-TTS-VoiceClone achieves state-of-the-art results:

Speech stability: Surpassed MiniMax and SeedTTS in speech stability for both Chinese and English cloning on Seed-tts-eval
Multilingual performance: Average WER of 1.835% and speaker similarity of 0.789 on TTS multilingual test set across 10 languages, outperforming MiniMax and ElevenLabs
Cross-lingual cloning: Cross-lingual voice clone capabilities reached SOTA, surpassing CosyVoice3

Tokenizer Performance

The Qwen-TTS-Tokenizer demonstrates superior reconstruction quality:

PESQ scores: 3.21 (wideband) and 3.68 (narrowband), significantly leading similar tokenizers
STOI score: 0.96, demonstrating superior reconstruction quality
UTMOS score: 4.16, indicating high perceptual quality
Speaker similarity: 0.95, significantly surpassing comparison models, indicating near-lossless speaker information preservation

Use Cases and Applications

Voice Design Applications

Qwen3-TTS enables users to create custom voices through natural language:

Character voices: Generate voices for characters in games, animations, and interactive media
Brand voices: Create consistent brand voices for marketing and advertising
Accessibility: Develop voices for accessibility applications
Content creation: Produce diverse voices for podcasts, audiobooks, and multimedia content

Voice Cloning Applications

The rapid 3-second voice cloning capability enables:

Personalization: Create personalized voice assistants and avatars
Content localization: Clone voices for multilingual content production
Accessibility: Preserve voices for individuals with speech impairments
Entertainment: Enable voice cloning for entertainment and creative projects

Real-Time Interactive Applications

The ultra-low latency streaming generation supports:

Live conversations: Real-time voice interactions in chatbots and virtual assistants
Interactive media: Live voice generation for games and interactive experiences
Telecommunications: Low-latency voice synthesis for communication systems
Accessibility tools: Real-time text-to-speech for assistive technologies

Premium Timbres

Qwen3-TTS includes 9 premium timbres covering various combinations:

苏瑶 Serena: Chinese female voice
福伯 Uncle Fu: Chinese male voice
十三 Vivian: Chinese female voice
艾登 Aiden: English male voice
甜茶 Ryan: English male voice
小野杏 Ono Anna: Japanese female voice
素熙 Sohee: Korean female voice
晓东 Dylan: Chinese Dialect (Beijing Dialect) male voice
程川 Eric: Chinese Dialect (Sichuan Dialect) male voice

These timbres provide diverse options for different use cases, languages, and cultural contexts.

Availability and Access

Open Source Release

The entire Qwen3-TTS multi-codebook model series is now open-sourced:

GitHub: Available on GitHub
HuggingFace: Available on HuggingFace
ModelScope: Available on ModelScope for Chinese users
Demos: Interactive demos available on HuggingFace and ModelScope

API Access

Qwen3-TTS is accessible via the Qwen API on Alibaba Cloud, providing:

Enterprise integration: Easy integration for enterprise applications
Scalable deployment: Cloud-based deployment for scalable applications
Managed service: Fully managed service with high availability

Documentation and Resources

Comprehensive resources are available:

Research paper: Technical paper available on GitHub
Documentation: Detailed documentation for developers
Examples: Extensive examples and use cases
Community support: Active community support and contributions

Technical Innovations

Multi-Codebook Architecture

The multi-codebook approach enables:

Rich representation: Captures complex acoustic features and variations
Efficient encoding: Efficient compression while maintaining quality
Flexible control: Enables fine-grained control over speech attributes

Non-DiT Architecture

The lightweight non-DiT architecture provides:

Speed advantages: Faster generation compared to diffusion-based approaches
Quality maintenance: Maintains high fidelity without diffusion overhead
Resource efficiency: Lower computational requirements

Contextual Understanding

Strong contextual understanding allows:

Semantic adaptation: Adapts tone, rhythm, and emotion based on text semantics
Natural expression: Produces more natural and contextually appropriate speech
Robustness: Improved handling of complex and noisy input text

Comparison with Other TTS Systems

Qwen3-TTS demonstrates competitive advantages:

Versus closed-source models: Outperforms MiniMax-Voice-Design in voice design tasks
Versus open-source models: Significantly leads other open-source models across multiple benchmarks
Versus commercial solutions: Competitive performance with ElevenLabs and other commercial TTS systems
Cross-lingual capabilities: SOTA performance in cross-lingual voice cloning, surpassing CosyVoice3

Conclusion

The open-sourcing of the Qwen3-TTS family represents a significant milestone in text-to-speech technology, providing developers and users with state-of-the-art speech generation capabilities. The combination of voice design, voice cloning, ultra-high-quality generation, and natural language control makes Qwen3-TTS one of the most comprehensive open-source TTS solutions available.

The model's exceptional performance across multiple benchmarks, support for 10 languages, ultra-low latency streaming generation, and comprehensive feature set position it as a leading solution for various applications—from real-time interactive systems to content creation and accessibility tools.

With both 1.7B and 0.6B parameter models available, developers can choose the appropriate balance between performance and efficiency for their specific use cases. The open-source nature of the project, combined with comprehensive documentation and community support, makes Qwen3-TTS accessible to a wide range of developers and researchers.

As artificial intelligence continues to evolve, open-source contributions like Qwen3-TTS play a crucial role in democratizing advanced AI capabilities and enabling innovation across industries. The release of this comprehensive TTS family demonstrates Alibaba's commitment to advancing open-source AI technology and providing valuable tools for the global AI community.

Explore more about Qwen models and text-to-speech technology in our models catalog and glossary.