2026.01.22 Tech Release · Apache 2.0 Open Source

Qwen3-TTS
Next-Gen Open Source Audio Architecture

Not just voice cloning, but a reconstruction of the TTS interaction experience. From 97ms ultra-low latency to the Dual-Track streaming architecture, fully analyzing the latest breakthrough from the Qwen team.

3s Instant Clone (VoiceClone)

Extract voice characteristics with just 3 seconds of reference audio. Supports cross-language cloning (e.g., generating English speech from Chinese audio) and even fun capabilities like animal mimicry.

Natural Language Voice Design

Create voices via prompts: "A husky elderly male, sorrowful tone". The model accurately understands emotion, prosody, and rhythm requirements within the semantics.

97ms Ultra-Low Latency

Outputs the first audio packet after streaming a single character input. Provides an end-to-end extreme experience for latency-sensitive scenarios like real-time translation and AI customer service.

Interactive Preview

Live Demo Environment

Experience VoiceDesign and VoiceClone features directly below.

Connecting to Hugging Face Space...

Core Architecture Deep Dive

How does Qwen3-TTS achieve such low latency and high quality?

Dual-Track Streaming Architecture

Traditional TTS models often trade off between "streaming (low latency)" and "non-streaming (high quality)". Qwen3-TTS innovatively proposes the Dual-Track Hybrid Architecture.

This allows the model to handle both tasks simultaneously. In streaming mode, it achieves 97ms first-packet latency.

12Hz SOTA Tokenizer

Audio generation quality largely depends on Tokenizer efficiency. Qwen3 adopts the latest 12Hz Multi-Codebook Tokenizer.

Compared to traditional 25Hz or 50Hz, 12Hz means extreme compression efficiency, significantly reducing the tokens the model needs to predict.

Model Sizes & Scenarios

0.6B Version (Base)

Designed for edge deployment and low-compute environments. Maintains core TTS capabilities while greatly reducing VRAM usage.

1.7B / 1.8B Version (Pro)

Pursuing ultimate performance and control. Larger parameters bring stronger semantic understanding for complex prompt instructions.

Comparison: Advantages at a Glance

FeatureQwen3-TTSGPT-4o-mini-tts (Analog)Traditional Open Source
Instruction FollowingSOTA (Precise Control)StrongWeak / Specific Format
Voice Customization1-sentence Prompt or 3s AudioAudio Clone OnlyFine-tuning Required
Streaming Latency97ms (Dual-Track)Approx 200-300msUsually >500ms
DeploymentOpen Source (0.6B/1.7B)Closed APIVariable