Raon-OpenTTS-0.3B

Technical Report | Code | Dataset | Raon-OpenTTS-1B

Raon-OpenTTS is an open-data, open-weight zero-shot TTS system that performs on par with state-of-the-art closed-data models. This is the 0.3B variant.

Key Features

Fully Open: Both model weights and training data (615K hours, 11 English speech datasets) are publicly available for reproducible TTS research.
More Robust on Wild Speech: Achieves lower WER than F5-TTS on the Wild split of Raon-OpenTTS-Eval, demonstrating better robustness to unscripted conversational speech prompts.
Large-Scale Curated Data: Trained on Raon-OpenTTS-Core (510K hours), quality-filtered from Raon-OpenTTS-Pool using combined DNSMOS, WER, and VAD rank-based filtering.
DiT Architecture: Based on F5-TTS Diffusion Transformer with flow matching, enabling efficient zero-shot speech synthesis.

Model Details


Parameters	336M
Architecture	DiT (Diffusion Transformer), based on F5-TTS
Config	dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4
Training Data	Raon-OpenTTS-Core (510.1K hours)
Steps	225K updates
Hardware	48 NVIDIA B200 GPUs
Batch Size	672K frames (14K/GPU x 48 GPUs)
Optimizer	AdamW, peak LR 1e-4, 50K warmup, linear decay, grad norm 1.0
Audio	80-ch mel-spectrogram, 16kHz, hop=256
Vocoder	HiFi-GAN (speechbrain/tts-hifigan-libritts-16kHz)

Benchmark Results (Seed-TTS-Eval)

WER measured via Whisper-large-v3; SIM via WavLM-large. All numbers are from the technical report.

Model	Params	WER (%) ↓	SIM ↑
Human	-	2.14	0.734
Seed-TTS	-	2.25	0.762
CosyVoice 3	1.5B	2.21	0.720
Qwen3-TTS	1.7B	1.46	0.715
F5-TTS	0.3B	2.04	0.671
Raon-OpenTTS-0.3B	0.3B	1.95	0.687
Raon-OpenTTS-1B	1.0B	1.78	0.749

See Raon-OpenTTS-1B for the larger model and full CV3-Eval / Raon-OpenTTS-Eval comparisons.

Benchmark Results (Raon-OpenTTS-Eval Wild)

WER measured via Whisper-large-v3; SIM via WavLM-large.

Model	Params	WER (%) ↓	SIM ↑
F5-TTS	0.3B	136.03	0.324
Raon-OpenTTS-0.3B	0.3B	5.83	0.571

Inference

For inference code and usage instructions, see KRAFTON/Raon-OpenTTS.

Training Details

Raon-OpenTTS-0.3B was trained for 225K update steps on 48 NVIDIA B200 GPUs using the Raon-OpenTTS-Core dataset (510.1K hours of English speech). The model uses AdamW optimization with a peak learning rate of 1e-4, 50K warmup steps, and linear decay. Gradient norm is clipped at 1.0. Waveform synthesis uses a HiFi-GAN vocoder pretrained on LibriTTS at 16kHz.

Citation

@article{kim2026raonopentts,
  title     = {Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech},
  author    = {Kim, Semin and Chung, Seungjun and Moon, Taehong and Lee, Sangheon and Ahn, Minyoung and Lee, Keon and Kim, Nam Soo and Cho, Jaewoong and Schmidt, Ludwig and Lee, Kangwook and Park, Dongmin},
  journal   = {arXiv preprint arXiv:2605.20830},
  year      = {2026},
  url       = {https://arxiv.org/abs/2605.20830}
}

License

This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

Downloads last month: 40

Dataset used to train KRAFTON/Raon-OpenTTS-0.3B

Space using KRAFTON/Raon-OpenTTS-0.3B 1

Collection including KRAFTON/Raon-OpenTTS-0.3B

Raon

Collection

9 items • Updated 1 day ago • 45

Paper for KRAFTON/Raon-OpenTTS-0.3B

Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech

Paper • 2605.20830 • Published 3 days ago