Alibaba Cloud’s Qwen group has open-sourced Qwen3-TTS, a household of multilingual text-to-speech fashions that concentrate on three core duties in a single stack, voice clone, voice design, and prime quality speech era.

Mannequin household and capabilities
Qwen3-TTS makes use of a 12Hz speech tokenizer and a pair of language mannequin sizes, 0.6B and 1.7B, packaged into 3 important duties. The open launch exposes 5 fashions, Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base for voice cloning and generic TTS, Qwen3-TTS-12Hz-0.6B-CustomVoice and Qwen3-TTS-12Hz-1.7B-CustomVoice for promptable preset audio system, and Qwen3-TTS-12Hz-1.7B-VoiceDesign free of charge type voice creation from pure language descriptions, together with the Qwen3-TTS-Tokenizer-12Hz codec.
All fashions help 10 languages, Chinese language, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. CustomVoice variants ship with 9 curated timbres, akin to Vivian, a brilliant younger Chinese language feminine voice, Ryan, a dynamic English male voice, and Ono_Anna, a playful Japanese feminine voice, every with a brief description that encodes timbre and talking fashion.
The VoiceDesign mannequin maps textual content directions on to new voices, for instance ‘communicate in a nervous teenage male voice with rising intonation’ and may then be mixed with the Base mannequin by first producing a brief reference clip and reusing it through create_voice_clone_prompt.

Structure, tokenizer, and streaming path
Qwen3-TTS is a twin monitor language mannequin, one monitor predicts discrete acoustic tokens from textual content, the opposite handles alignment and management alerts. The system is educated on greater than 5 million hours of multilingual speech in 3 pre coaching phases that transfer from basic mapping, to prime quality knowledge, to lengthy context help as much as 32,768 tokens.
A key element is the Qwen3-TTS-Tokenizer-12Hz codec. It operates at 12.5 frames per second, about 80 ms per token, and makes use of 16 quantizers with a 2048 entry codebook. On LibriSpeech take a look at clear it reaches PESQ wideband 3.21, STOI 0.96, and UTMOS 4.16, outperforming SpeechTokenizer, XCodec, Mimi, FireredTTS 2 and different current semantic tokenizers, whereas utilizing an identical or decrease body fee.
The tokenizer is applied as a pure left context streaming decoder, so it may possibly emit waveforms as quickly as sufficient tokens can be found. With 4 tokens per packet, every streaming packet carries 320 ms of audio. The non-DiT decoder and BigVGAN free design reduces decode price and simplifies batching.
On the language mannequin facet, the analysis group stories finish to finish streaming measurements on a single vLLM backend with torch.compile and CUDA Graph optimizations. For Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base at concurrency 1, the primary packet latency is round 97 ms and 101 ms, with actual time elements of 0.288 and 0.313 respectively. Even at concurrency 6, first packet latency stays round 299 ms and 333 ms.

Alignment and management
Put up coaching makes use of a staged alignment pipeline. First, Direct Choice Optimization aligns generated speech with human preferences on multilingual knowledge. Then GSPO with rule primarily based rewards improves stability and prosody. A remaining speaker fantastic tuning stage on the Base mannequin yields goal speaker variants whereas preserving the core capabilities of the overall mannequin.
Instruction following is applied in a ChatML fashion format, the place textual content directions about fashion, emotion or tempo are prepended to the enter. This identical interface powers VoiceDesign, CustomVoice fashion prompts, and fantastic grained edits for cloned audio system.
Benchmarks, zero shot cloning, and multilingual speech
On the Seed-TTS take a look at set, Qwen3-TTS is evaluated as a zero-shot voice cloning system. The Qwen3-TTS-12Hz-1.7B-Base mannequin reaches a Phrase Error Fee of 0.77 on test-zh and 1.24 on test-en. The analysis group highlights the 1.24 WER on test-en as cutting-edge among the many in contrast programs, whereas the Chinese language WER is near, however not decrease than, the very best CosyVoice 3 rating.

On a multilingual TTS take a look at set overlaying 10 languages, Qwen3-TTS achieves the bottom WER in 6 languages, Chinese language, English, Italian, French, Korean, and Russian, and aggressive efficiency on the remaining 4 languages, whereas additionally acquiring the best speaker similarity in all 10 languages in comparison with MiniMax-Speech and ElevenLabs Multilingual v2.
Cross-lingual evaluations present that Qwen3-TTS-12Hz-1.7B-Base reduces combined error fee for a number of language pairs, akin to zh-to-ko, the place the error drops from 14.4 for CosyVoice3 to 4.82, a couple of 66 % relative discount.
On InstructTTSEval, the Qwen3TTS-12Hz-1.7B-VD VoiceDesign mannequin units new cutting-edge scores amongst open supply fashions on Description-Speech Consistency and Response Precision in each Chinese language and English, and is aggressive with business programs like Hume and Gemini on a number of metrics.
Key Takeaways
- Full open supply multilingual TTS stack: Qwen3-TTS is an Apache 2.0 licensed suite that covers 3 duties in a single stack, prime quality TTS, 3 second voice cloning, and instruction primarily based voice design throughout 10 languages utilizing the 12Hz tokenizer household.
- Environment friendly discrete codec and actual time streaming: The Qwen3-TTS-Tokenizer-12Hz makes use of 16 codebooks at 12.5 frames per second, reaches robust PESQ, STOI and UTMOS scores, and helps packetized streaming with about 320 ms of audio per packet and sub 120 ms first packet latency for the 0.6B and 1.7B fashions within the reported setup.
- Process particular mannequin variants: The discharge affords Base fashions for cloning and generic TTS, CustomVoice fashions with 9 predefined audio system and elegance prompts, and a VoiceDesign mannequin that generates new voices straight from pure language descriptions which might then be reused by the Base mannequin.
- Sturdy alignment and multilingual high quality: A multi stage alignment pipeline with DPO, GSPO and speaker fantastic tuning offers Qwen3-TTS low phrase error charges and excessive speaker similarity, with lowest WER in 6 of 10 languages and the very best speaker similarity in all 10 languages among the many evaluated programs, and cutting-edge zero shot English cloning on Seed TTS.
Try the Mannequin Weights, Repo and Playground. Additionally, be at liberty to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as nicely.