About 6 months ago I wrote a small rant on how open source TTS models still sucked. 6 months later, I'm happy to report that isn't the case anymore. January this year, Qwen, the famous Chinese AI lab, released Qwen3-TTS, an open-weights series of TTS models. The release included 2 CustomVoice models (pre-made voices + style control), 2 base models (zero-shot voice cloning + fine-tuning), and a VoiceDesign model (create voices from descriptions). With a 0.6B and a 1.7B variant - they're all quite small. Qwen3-TTS in a nutshell. credits. There are a lot of things to like. First, it fully supports voice cloning - via fine-tuning or zero-shot conditioning. Voxtral, for example, doesn't. Second: the license is Apache 2.0 - which means we can do whatever we want with it (beware). Third, it's supported by a strong inference engine. In this case vLLM-Omni. And more importantly: it avoids many of the small issues other open-source TTS models had when generating longer pieces of text — squeaks,…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.