Configuration Parsing Warning: Invalid JSON for config file config.json

Nile-XTTS Model ๐Ÿ‡ช๐Ÿ‡ฌ

Nile-XTTS is a fine-tuned version of XTTS v2 optimized for Egyptian Arabic (ุงู„ู„ู‡ุฌุฉ ุงู„ู…ุตุฑูŠุฉ) text-to-speech synthesis with zero-shot voice cloning capabilities.

Model Description

This model was fine-tuned on the NileTTS dataset, comprising 38 hours of Egyptian Arabic speech across medical, sales, and general conversation domains.

Key Features

  • Egyptian Arabic optimized: Trained specifically on Egyptian dialect, not MSA or Gulf Arabic
  • Zero-shot voice cloning: Clone any voice with just a 6-second reference audio
  • Improved intelligibility: 29.9% reduction in WER compared to base XTTS v2
  • Better pronunciation: 49.4% reduction in CER for Egyptian Arabic

Performance

Metric XTTS v2 (Baseline) Nile-XTTS-v2 (Ours) Improvement
WER 26.8% 18.8% 29.9%
CER 8.1% 4.1% 49.4%
Speaker Similarity 0.713 0.755 +5.9%

Usage

Interactive Demo

Installation

pip install TTS

Usage (Direct Model Loading)

import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# load config and model
config = XttsConfig()
config.load_json("config.json")

model = Xtts.init_from_config(config)
model.load_checkpoint(
    config,
    checkpoint_path="model.pth",
    vocab_path="vocab.json",
    use_deepspeed=False
)
model.cuda()
model.eval()

# get speaker latents from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference.wav",
    gpt_cond_len=6,
    max_ref_length=30,
    sound_norm_refs=False
)

# synth speech
out = model.inference(
    text="ู…ุฑุญุจุงุŒ ุฅุฒูŠูƒ ุงู„ู†ู‡ุงุฑุฏู‡ุŸ",
    language="ar",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.7,
)

# save output
torchaudio.save("output.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

Training Details

  • Base model: XTTS v2
  • Training data: NileTTS dataset (38 hours, 2 speakers)
  • Epochs: 8 (early stopping)
  • Learning rate: 5e-6

Limitations

  • Limited to 2 speaker voices in training data
  • Optimized for Egyptian Arabic; may not perform as well on other Arabic dialects
  • Zero-shot cloning quality depends on reference audio quality

Citation

If you use this model, please cite: [TO BE ADDED]

License

This model is released under the Apache 2.0 license, following the original XTTS v2 license.

Acknowledgements

  • Coqui TTS for the XTTS v2 base model
  • The NileTTS team for the dataset creation
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for KickItLikeShika/NileTTS-XTTS

Base model

coqui/XTTS-v2
Finetuned
(57)
this model