This is a simplified guide to an AI model called Qwen3-TTS-12Hz-1.7B-CustomVoice maintained by Qwen. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.
Model overview
Qwen3-TTS-12Hz-1.7B-CustomVoice is a text-to-speech model created by Qwen that converts written text into natural-sounding speech across 10 major languages. The model supports instruction-based voice control, allowing users to select from 9 premium voice timbres that cover various combinations of gender, age, language, and dialect. This model represents a significant advancement in speech synthesis technology compared to earlier approaches, utilizing a discrete multi-codebook language model architecture that eliminates cascading errors found in traditional systems.
The architecture employs a Dual-Track hybrid streaming generation system, enabling both streaming and non-streaming audio production from a single model. The self-developed Qwen3-TTS-Tokenizer-12Hz powers the system, achieving efficient acoustic compression while maintaining high-fidelity speech reconstruction through a lightweight non-DiT architecture.
Model inputs and outputs
Inputs
- Text content: Written text in any of the 10 supported languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian)
- Language specification: The target language for synthesis
- Speaker selection: Choice from 9 predefined premium voice timbres
- Optional instructions: Natural language directives for controlling tone, emotion, speaking rate, and prosody
Outputs
- Audio waveforms: High-fidelity speech audio generated in real-time or batch mode
- Streaming capability: Immediate audio output after single character input, with end-to-end synthesis latency as low as 97ms
Capabilities
The model understands context within text and applies that understanding to generate emotionally appropriate speech. It preserves paralinguistic information and acoustic environmental features through powerful speech representation. Users can instruct the model to adjust timbre and emotional expression based on text semantics, enabling the synthesis of speech that matches the intended tone of written content.
Streaming generation produces the first audio packet immediately after receiving minimal input, making this suitable for real-time interactive applications. The model handles noisy or imperfect input text with marked robustness, recovering gracefully from typos and formatting irregularities.
What can I use it for?
Create interactive voice assistants that respond to users with natural speech in their preferred language. Build audiobook production systems that generate multiple voice variations from manuscripts. Develop multilingual customer service platforms where text responses are converted to speech with controlled emotional tone. Generate personalized audio messages where businesses customize voice characteristics to match brand identity. Implement real-time translation systems that pair translation with synthetic speech output in the user's native language. Create accessibility tools that convert written content to speech for individuals with visual impairments.
Things to try
Experiment with instruction control by providing natural language descriptions of desired voice characteristics alongside your text. Test the streaming generation capability with very short text inputs to observe the rapid response latency. Compare outputs across different speaker timbres to understand the voice variety available within the model. Feed the model emotionally charged text and observe whether the synthesis reflects appropriate prosodic changes. Try multilingual inputs that mix languages within a single prompt to see how the model handles code-switching scenarios. Use the rapid voice cloning capability from the base model variant to understand the foundation this custom voice model builds upon.
