2026.01.22 Tech Release · Apache 2.0 Open Source

Qwen3-TTS
Next-Gen Open Source Audio Architecture

Not just voice cloning, but a reconstruction of the TTS interaction experience. From 97ms ultra-low latency to the Dual-Track streaming architecture, fully analyzing the latest breakthrough from the Qwen team.

Live Demo View Architecture

3s Instant Clone (VoiceClone)

Extract voice characteristics with just 3 seconds of reference audio. Supports cross-language cloning (e.g., generating English speech from Chinese audio) and even fun capabilities like animal mimicry.

Natural Language Voice Design

Create voices via prompts: "A husky elderly male, sorrowful tone". The model accurately understands emotion, prosody, and rhythm requirements within the semantics.

97ms Ultra-Low Latency

Outputs the first audio packet after streaming a single character input. Provides an end-to-end extreme experience for latency-sensitive scenarios like real-time translation and AI customer service.

Interactive Preview

Live Demo Environment

Experience VoiceDesign and VoiceClone features directly below.

Space Status: Running

Connecting to Hugging Face Space...

Core Architecture Deep Dive

How does Qwen3-TTS achieve such low latency and high quality?

Dual-Track Streaming Architecture

Traditional TTS models often trade off between "streaming (low latency)" and "non-streaming (high quality)". Qwen3-TTS innovatively proposes the Dual-Track Hybrid Architecture.

This allows the model to handle both tasks simultaneously. In streaming mode, it achieves 97ms first-packet latency.

12Hz SOTA Tokenizer

Audio generation quality largely depends on Tokenizer efficiency. Qwen3 adopts the latest 12Hz Multi-Codebook Tokenizer.

Compared to traditional 25Hz or 50Hz, 12Hz means extreme compression efficiency, significantly reducing the tokens the model needs to predict.

Model Sizes & Scenarios

0.6B Version (Base)

Designed for edge deployment and low-compute environments. Maintains core TTS capabilities while greatly reducing VRAM usage.

1.7B / 1.8B Version (Pro)

Pursuing ultimate performance and control. Larger parameters bring stronger semantic understanding for complex prompt instructions.

Comparison: Advantages at a Glance

Feature	Qwen3-TTS	GPT-4o-mini-tts (Analog)	Traditional Open Source
Instruction Following	SOTA (Precise Control)	Strong	Weak / Specific Format
Voice Customization	1-sentence Prompt or 3s Audio	Audio Clone Only	Fine-tuning Required
Streaming Latency	97ms (Dual-Track)	Approx 200-300ms	Usually >500ms
Deployment	Open Source (0.6B/1.7B)	Closed API	Variable

Get Models & Docs

GitHub Repo

Code & Fine-tuning Guide

Hugging Face

Model Weights

ModelScope

Fast Download (CN)

Tech Report

Read Official Blog