OmniVoice: Open-Source TTS with 600+ Languages and Zero-Shot Voice Cloning
OmniVoice: Open-Source TTS with 600+ Languages and Zero-Shot Voice Cloning The TTS landscape just shifted. On March 31, 2026, the k2-fsa team — the same group behind Kaldi and k2, with Daniel Povey as a core contributor — released OmniVoice: an Apache 2.0 licensed TTS model supporting 600+ languages zero-shot, with 40x real-time inference speed. In just three weeks, it hit 3,775 GitHub stars and 460,000+ HuggingFace downloads. Here's why developers are paying attention and how to run it locally. Metric Value Languages supported 600+ (zero-shot) RTF (Real-Time Factor) 0.025 (40x faster than real-time) License Apache 2.0 (commercial use OK) Base model Qwen3-0.6B Reference audio needed 3–10 seconds (or none) Hardware Runs on consumer GPUs, Apple Silicon MPS supported Compare this to commercial services: ElevenLabs Pro: $22/month, limited characters ElevenLabs Business: $99/month Azure TTS: $16/million characters Google Cloud TTS: $16/million characters OmniVoice: zero cost after deployment. Unlimited usage on your own hardware. OmniVoice supports three inference modes through a single unified API: Clone a voice from a short reference audio clip. Whisper auto-transcribes the reference text if you don't provide it. from omnivoice import OmniVoice import soundfile as sf import torch model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16, ) audio = model.generate( text="This voice was cloned from a 3-second reference.", ref_audio="ref.wav", # ref_text optional - Whisper auto-transcribes ) sf.write("cloned.wav", audio[0], 24000) Design a voice from scratch using natural language attributes. No reference audio required. audio = model.generate( text="This is a designed voice.", instruct="female, low pitch, british accent", ) Combine attributes freely: gender, age, pitch, speech speed, accent, dialect, emotional tone. No voice prompt at all. Fastest mode for quick prototyping. audio = model.generate(text="Quick test output.") PyTorch must be installed first, with version pinned to 2.8.0. # NVIDIA (CUDA 12.8) pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 \ --extra-index-url https://download.pytorch.org/whl/cu128 # Apple Silicon (M1/M2/M3) pip install torch==2.8.0 torchaudio==2.8.0 # OmniVoice pip install omnivoice Verify GPU/MPS detection: import torch print("CUDA:", torch.cuda.is_available()) print("MPS:", torch.backends.mps.is_available()) The fastest way to validate your setup is the bundled Gradio demo. omnivoice-demo --ip 0.0.0.0 --port 8001 Navigate to http://localhost:8001 and test all three modes through a UI. Besides the Python API, OmniVoice ships with two CLI tools: # Single inference omnivoice-infer --model k2-fsa/OmniVoice \ --text "Hello world." \ --ref_audio ref.wav \ --output hello.wav # Multi-GPU batch inference omnivoice-infer-batch --model k2-fsa/OmniVoice \ --test_list test.jsonl \ --res_dir results/ Batch format (JSONL): {"id": "clip_001", "text": "First clip.", "ref_audio": "ref.wav"} {"id": "clip_002", "text": "Second clip.", "ref_audio": "ref.wav"} Perfect for audiobook generation or large-scale narration pipelines. Drop non-verbal tokens anywhere in the text: text = "That's hilarious [laughter] but also a bit concerning [sigh]." audio = model.generate(text=text, ref_audio="ref.wav") Available tags include [laughter], [sigh], [question-ah], [surprise-wa]. For homophones and proper nouns, override pronunciation directly. # "bass" as musical instrument, not low-frequency sound audio = model.generate(text="He plays [B EY1 S] guitar.") # Force specific tone audio = model.generate(text="打ZHE2") OmniVoice uses a Diffusion Language Model hybrid architecture. It's neither pure diffusion nor pure autoregressive — it combines the quality benefits of diffusion with the speed advantages of LLM-style generation. The base model is Qwen3-0.6B, making it light enough for consumer hardware while leveraging the language understanding of a modern LLM. This is a different direction from previous open-source TTS projects (Bark, XTTS, F5-TTS), and it seems to be paying off in both quality and inference speed. The community has been active on GitHub Issue #44 discussing real-world usage. Voice Design consistency. Each call produces slightly different timbre. Generate once, save the output, then reuse as ref_audio to lock in a consistent voice for an entire project. Prompt caching. Use create_voice_clone_prompt to precompute reference audio encodings once, then reuse the cached prompt for repeated generation. Critical for throughput on long-form content. Number normalization. Raw digits like "123" can produce inconsistent output. Normalize to words ("one hundred twenty-three") using WeTextProcessing or similar before passing text to the model. Cross-lingual accent bleed. If you use a Korean reference to generate English, the output has a Korean accent. For neutral target language accent, use native-speaker references. OmniVoice fits several developer profiles well: Solo developers and indie hackers who were paying commercial TTS subscriptions just for hobby projects AI agent builders needing voice output without vendor lock-in Content creators doing multilingual localization (YouTube, podcasts) Voice cloning experiments where 3-second references unlock a lot of creative possibilities Low-resource language applications (the 600+ language coverage includes many languages with no good commercial option) The k2-fsa ecosystem has a strong track record of long-term maintenance (Kaldi is still actively used 15+ years after release). That matters when you're deciding whether to build production infrastructure on a new model. If you're evaluating TTS options, OmniVoice deserves a spot in the comparison. The combination of 600+ language support, 40x real-time inference, and Apache 2.0 licensing is genuinely rare in open source TTS today. GitHub repo HuggingFace model card Paper on arXiv Issue #44: community Q&A Have you tried OmniVoice yet? Would love to hear how it compares to your current TTS setup in the comments.
