File size: 5,403 Bytes
ef92654
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38c8d8c
 
 
 
 
 
 
 
 
 
ef92654
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
license: mit
language:
- en
- gu
base_model:
- Qwen/Qwen2.5-0.5B
pipeline_tag: text-to-speech
tags:
- tts
- indian-accent
---
# Ind-QwenTTS

A lightweight multilingual Text-to-Speech system with accent control for English and Gujarati.

## Features

- Multilingual: English + Gujarati
- Accent Control: Indian & Gujarati accents
- 4 voices (2 male, 2 female)
- Accent transfer capability
- Fast inference with 0.5B parameters

## Supported Voices

| Speaker ID | Language | Accent | Gender |
|-----------|----------|---------|---------|
| `SPK_EN_M_001` | English | Indian | Male |
| `SPK_EN_F_001` | English | Indian | Female |
| `SPK_GU_M_001` | Gujarati | Gujarati | Male |
| `SPK_GU_F_001` | Gujarati | Gujarati | Female |

## Installation

```bash
pip install transformers torch torchaudio snac torchcodec
```

## Usage

```python
import torch
import torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("AryanNsc/IND-QWENTTS-V1", fix_mistral_regex=True)
model = AutoModelForCausalLM.from_pretrained("AryanNsc/IND-QWENTTS-V1", torch_dtype=torch.bfloat16).to(device).eval()
snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device).eval()

def generate_speech(text, language="english", accent="indian", gender="M", speaker=None, output_file="output.wav"):
    if speaker is None:
        speaker_map = {
            ("english", "M"): "SPK_EN_M_001",
            ("english", "F"): "SPK_EN_F_001",
            ("gujarati", "M"): "SPK_GU_M_001",
            ("gujarati", "F"): "SPK_GU_F_001"
        }
        speaker = speaker_map.get((language, gender), "SPK_EN_M_001")
    
    prompt = f"<lang>{language}</lang><accent>{accent}</accent><gender>{gender}</gender><speaker>{speaker}</speaker> {text}"
    
    input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(device)
    
    start_tokens = torch.tensor([
        tokenizer.convert_tokens_to_ids("<|endoftext|>"),
        tokenizer.convert_tokens_to_ids("<soh>"),
        tokenizer.convert_tokens_to_ids("<soa>"),
        tokenizer.convert_tokens_to_ids("<sos>")
    ], device=device).unsqueeze(0)
    
    full_input = torch.cat([input_ids, start_tokens], dim=1)
    
    with torch.no_grad():
        output = model.generate(
            full_input,
            max_new_tokens=1500,
            temperature=0.7,        
            top_p=0.85,
            repetition_penalty=1.15,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.convert_tokens_to_ids("<eos>")
        )
    
    generated_ids = output[0, full_input.shape[1]:]
    
    eos_id = tokenizer.convert_tokens_to_ids("<eos>")
    if len(generated_ids) > 0 and generated_ids[-1] == eos_id:
        generated_ids = generated_ids[:-1]
    
    if len(generated_ids) % 7 != 0:
        trunc_len = (len(generated_ids) // 7) * 7
        generated_ids = generated_ids[:trunc_len]
    
    if len(generated_ids) == 0:
        print("Error: No audio generated.")
        return

    codes = generated_ids.reshape(-1, 7).T
    
    snac_offset = model.config.vocab_size - 4096
    codes = codes - snac_offset
    codes = torch.clamp(codes, min=0)
    
    l1 = codes[0, :]
    l2 = torch.stack([codes[1, :], codes[4, :]], dim=1).flatten()
    l3 = torch.stack([codes[2, :], codes[3, :], codes[5, :], codes[6, :]], dim=1).flatten()
    
    with torch.inference_mode():
        audio = snac.decode([l1.unsqueeze(0), l2.unsqueeze(0), l3.unsqueeze(0)])
    
    audio_tensor = audio.squeeze(0).cpu()
    torchaudio.save(output_file, audio_tensor, 24000)
    print(f"Saved to {output_file}")

generate_speech(
    text="The competition results will be announced tomorrow morning.",
    language="english",
    accent="indian",
    gender="M",
    output_file="test_english.wav"
)
```

## Examples

**Basic English synthesis:**
```python
generate_speech("Hello world, this is a test.", language="english", accent="indian", gender="M")
```

**Gujarati synthesis:**
```python
generate_speech("નમસ્તે, તમે કેમ છો?", language="gujarati", accent="gujarati", gender="F")
```

## Audio Samples

Here are some samples generated by the model.

| Description | Speaker | Audio |
|:--- |:--- |:--- |
| **Indian English**<br>Standard Generation | Male (`SPK_EN_M_001`) | <audio controls src="https://huggingface.co/AryanNsc/IND-QWENTTS-V1/resolve/main/samples/output_01_standard.wav"></audio> |
| **Indian English**<br>Long Narrative | Female (`SPK_EN_F_001`) | <audio controls src="https://huggingface.co/AryanNsc/IND-QWENTTS-V1/resolve/main/samples/output_02_long.wav"></audio> |
| **Gujarati**<br>Native Speech | Female (`SPK_GU_F_001`) | <audio controls src="https://huggingface.co/AryanNsc/IND-QWENTTS-V1/resolve/main/samples/output_03_gujarati.wav"></audio> |

## Parameters

- `text`: Text to synthesize
- `language`: `"english"` or `"gujarati"`
- `accent`: `"indian"` or `"gujarati"`
- `gender`: `"M"` (male) or `"F"` (female)
- `speaker`: Optional specific speaker ID (auto-selected if not provided)

## Training Code

Training pipeline and scripts will be open-sourced soon.

## Citation

```bibtex
@misc{ind-qwentts-2024,
  title={Ind-QwenTTS: Multilingual Accent-Aware TTS},
  author={Aryan Purohit},
  year={2025}
}
```