Updation: Simplifying the AI engine to use only ai4bharat/indicwav2vec-hindi for ASR.
Browse files- MODEL_SUMMARY.md +130 -0
- TRANSCRIPT_DEBUG.md +213 -0
- diagnosis/ai_engine/detect_stuttering.py +87 -212
MODEL_SUMMARY.md
ADDED
|
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# AI Engine Model Summary
|
| 2 |
+
|
| 3 |
+
## Simplified ASR-Only Configuration
|
| 4 |
+
|
| 5 |
+
This engine has been simplified to use **ONLY** the IndicWav2Vec Hindi model for Automatic Speech Recognition (ASR).
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Active Model
|
| 10 |
+
|
| 11 |
+
### 1. IndicWav2Vec Hindi (Primary & Only Model)
|
| 12 |
+
- **Model ID**: `ai4bharat/indicwav2vec-hindi`
|
| 13 |
+
- **Type**: `Wav2Vec2ForCTC`
|
| 14 |
+
- **Purpose**: Automatic Speech Recognition (ASR) for Hindi and Indian languages
|
| 15 |
+
- **Status**: โ
Active - Loaded at startup
|
| 16 |
+
- **Location**: `detect_stuttering.py` lines 26, 148-156
|
| 17 |
+
- **Authentication**: Requires `HF_TOKEN` environment variable
|
| 18 |
+
|
| 19 |
+
**Features:**
|
| 20 |
+
- Speech-to-text transcription
|
| 21 |
+
- Confidence scoring from model predictions
|
| 22 |
+
- Text-based stutter analysis (simple repetition detection)
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## Removed Models
|
| 27 |
+
|
| 28 |
+
The following models have been **removed** to simplify the engine:
|
| 29 |
+
|
| 30 |
+
1. โ **MMS Language Identification (LID)** - `facebook/mms-lid-126`
|
| 31 |
+
- Previously used for language detection
|
| 32 |
+
- No longer needed - IndicWav2Vec handles Hindi natively
|
| 33 |
+
|
| 34 |
+
2. โ **Isolation Forest** (sklearn)
|
| 35 |
+
- Previously used for anomaly detection
|
| 36 |
+
- Removed - using simple text-based analysis instead
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## Removed Libraries
|
| 41 |
+
|
| 42 |
+
The following signal processing libraries are no longer used:
|
| 43 |
+
|
| 44 |
+
- โ `parselmouth` (Praat) - Voice quality analysis
|
| 45 |
+
- โ `fastdtw` - Repetition detection via DTW
|
| 46 |
+
- โ `sklearn` - Machine learning algorithms
|
| 47 |
+
- โ Complex acoustic feature extraction (MFCC, formants, etc.)
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## Current Pipeline
|
| 52 |
+
|
| 53 |
+
```
|
| 54 |
+
Audio Input
|
| 55 |
+
โ
|
| 56 |
+
IndicWav2Vec Hindi ASR
|
| 57 |
+
โ
|
| 58 |
+
Text Transcription
|
| 59 |
+
โ
|
| 60 |
+
Basic Text Analysis
|
| 61 |
+
โ
|
| 62 |
+
Results (transcript + simple stutter detection)
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
---
|
| 66 |
+
|
| 67 |
+
## API Response Format
|
| 68 |
+
|
| 69 |
+
The simplified engine returns:
|
| 70 |
+
|
| 71 |
+
```json
|
| 72 |
+
{
|
| 73 |
+
"actual_transcript": "transcribed text",
|
| 74 |
+
"target_transcript": "expected text (if provided)",
|
| 75 |
+
"mismatched_chars": ["timestamps of low confidence regions"],
|
| 76 |
+
"mismatch_percentage": 0.0,
|
| 77 |
+
"ctc_loss_score": 0.0,
|
| 78 |
+
"stutter_timestamps": [{"type": "repetition", "start": 0.0, "end": 0.5, ...}],
|
| 79 |
+
"total_stutter_duration": 0.0,
|
| 80 |
+
"stutter_frequency": 0.0,
|
| 81 |
+
"severity": "none|mild|moderate|severe",
|
| 82 |
+
"confidence_score": 0.8,
|
| 83 |
+
"speaking_rate_sps": 0.0,
|
| 84 |
+
"analysis_duration_seconds": 0.0,
|
| 85 |
+
"model_version": "indicwav2vec-hindi-asr-v1"
|
| 86 |
+
}
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## Dependencies
|
| 92 |
+
|
| 93 |
+
**Required:**
|
| 94 |
+
- `transformers` 4.35.0 - For IndicWav2Vec model
|
| 95 |
+
- `torch` 2.0.1 - PyTorch backend
|
| 96 |
+
- `librosa` โฅ0.10.0 - Audio loading (16kHz resampling)
|
| 97 |
+
- `numpy` - Array operations
|
| 98 |
+
|
| 99 |
+
**Optional (for legacy methods, not used in ASR mode):**
|
| 100 |
+
- `parselmouth` - Voice quality (not used)
|
| 101 |
+
- `fastdtw` - DTW algorithm (not used)
|
| 102 |
+
- `sklearn` - ML algorithms (not used)
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Usage
|
| 107 |
+
|
| 108 |
+
```python
|
| 109 |
+
from diagnosis.ai_engine.detect_stuttering import get_stutter_detector
|
| 110 |
+
|
| 111 |
+
detector = get_stutter_detector()
|
| 112 |
+
result = detector.analyze_audio(
|
| 113 |
+
audio_path="path/to/audio.wav",
|
| 114 |
+
proper_transcript="expected text", # optional
|
| 115 |
+
language="hindi" # default: hindi
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
print(result['actual_transcript']) # ASR transcription
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
## Notes
|
| 124 |
+
|
| 125 |
+
- The engine focuses **only** on ASR transcription
|
| 126 |
+
- Stutter detection is simplified to text-based repetition analysis
|
| 127 |
+
- No complex acoustic feature extraction
|
| 128 |
+
- Faster and lighter than the previous multi-model approach
|
| 129 |
+
- Optimized for Hindi but can handle other Indian languages
|
| 130 |
+
|
TRANSCRIPT_DEBUG.md
ADDED
|
@@ -0,0 +1,213 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Transcript Debugging Guide
|
| 2 |
+
|
| 3 |
+
## Issue: Empty Transcripts ("No transcript available")
|
| 4 |
+
|
| 5 |
+
## Complete Flow Analysis
|
| 6 |
+
|
| 7 |
+
### 1. Django App โ API Request (`slaq-version-c/diagnosis/ai_engine/detect_stuttering.py`)
|
| 8 |
+
|
| 9 |
+
**Location:** Line 269-274
|
| 10 |
+
```python
|
| 11 |
+
response = requests.post(
|
| 12 |
+
self.api_url,
|
| 13 |
+
files=files,
|
| 14 |
+
data={
|
| 15 |
+
"transcript": proper_transcript if proper_transcript else "",
|
| 16 |
+
"language": lang_code,
|
| 17 |
+
},
|
| 18 |
+
timeout=self.api_timeout
|
| 19 |
+
)
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
**Status:** โ
Sending transcript parameter correctly
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
### 2. API Receives Request (`slaq-version-c-ai-enginee/app.py`)
|
| 27 |
+
|
| 28 |
+
**Location:** Line 70-73
|
| 29 |
+
```python
|
| 30 |
+
@app.post("/analyze")
|
| 31 |
+
async def analyze_audio(
|
| 32 |
+
audio: UploadFile = File(...),
|
| 33 |
+
transcript: str = Form("") # โ
Fixed: Now uses Form() for multipart
|
| 34 |
+
):
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
**Status:** โ
Fixed - Now correctly receives transcript via Form()
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
### 3. API Calls Model (`slaq-version-c-ai-enginee/app.py`)
|
| 42 |
+
|
| 43 |
+
**Location:** Line 106
|
| 44 |
+
```python
|
| 45 |
+
result = detector.analyze_audio(temp_file, transcript)
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
**Status:** โ
Passing transcript correctly
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
### 4. Model Transcribes Audio (`slaq-version-c-ai-enginee/diagnosis/ai_engine/detect_stuttering.py`)
|
| 53 |
+
|
| 54 |
+
**Location:** Line 313-369 (`_transcribe_with_timestamps`)
|
| 55 |
+
|
| 56 |
+
**Potential Issues:**
|
| 57 |
+
- โ IndicWav2Vec decoding might not work with `processor.batch_decode()`
|
| 58 |
+
- โ Need to use tokenizer directly
|
| 59 |
+
- โ Model might not be producing valid predictions
|
| 60 |
+
|
| 61 |
+
**Status:** โ ๏ธ **LIKELY ISSUE HERE** - Decoding method may be incorrect
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
### 5. Model Returns Result (`slaq-version-c-ai-enginee/diagnosis/ai_engine/detect_stuttering.py`)
|
| 66 |
+
|
| 67 |
+
**Location:** Line 787-794
|
| 68 |
+
```python
|
| 69 |
+
actual_transcript = transcript if transcript else ""
|
| 70 |
+
target_transcript = proper_transcript if proper_transcript else transcript if transcript else ""
|
| 71 |
+
|
| 72 |
+
return {
|
| 73 |
+
'actual_transcript': actual_transcript,
|
| 74 |
+
'target_transcript': target_transcript,
|
| 75 |
+
...
|
| 76 |
+
}
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
**Status:** โ
Returns transcripts correctly (if transcript is not empty)
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
### 6. API Returns Response (`slaq-version-c-ai-enginee/app.py`)
|
| 84 |
+
|
| 85 |
+
**Location:** Line 109-113
|
| 86 |
+
```python
|
| 87 |
+
actual = result.get('actual_transcript', '')
|
| 88 |
+
target = result.get('target_transcript', '')
|
| 89 |
+
logger.info(f"๐ Result transcripts - Actual: '{actual[:100]}' (len: {len(actual)}), Target: '{target[:100]}' (len: {len(target)})")
|
| 90 |
+
return result
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
**Status:** โ
Returns JSON with transcripts
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
### 7. Django Receives Response (`slaq-version-c/diagnosis/ai_engine/detect_stuttering.py`)
|
| 98 |
+
|
| 99 |
+
**Location:** Line 279-410
|
| 100 |
+
```python
|
| 101 |
+
result = response.json()
|
| 102 |
+
# ... formatting ...
|
| 103 |
+
actual_transcript = str(api_result.get('actual_transcript', '')).strip()
|
| 104 |
+
target_transcript = str(api_result.get('target_transcript', '')).strip()
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
**Status:** โ
Extracts transcripts correctly
|
| 108 |
+
|
| 109 |
+
---
|
| 110 |
+
|
| 111 |
+
### 8. Django Saves to Database (`slaq-version-c/diagnosis/tasks.py`)
|
| 112 |
+
|
| 113 |
+
**Location:** Line 141-142
|
| 114 |
+
```python
|
| 115 |
+
actual_transcript=actual_transcript,
|
| 116 |
+
target_transcript=target_transcript,
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
**Status:** โ
Saves correctly
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
## Root Cause Analysis
|
| 124 |
+
|
| 125 |
+
### Most Likely Issue: Transcription Decoding
|
| 126 |
+
|
| 127 |
+
The IndicWav2Vec model (`ai4bharat/indicwav2vec-hindi`) may require:
|
| 128 |
+
1. **Direct tokenizer access** instead of `processor.batch_decode()`
|
| 129 |
+
2. **CTC decoding** with proper tokenizer
|
| 130 |
+
3. **Special handling** for Indic scripts
|
| 131 |
+
|
| 132 |
+
### Fix Applied
|
| 133 |
+
|
| 134 |
+
Updated `_transcribe_with_timestamps()` to:
|
| 135 |
+
1. Try multiple decoding methods
|
| 136 |
+
2. Use tokenizer directly if available
|
| 137 |
+
3. Add comprehensive error logging
|
| 138 |
+
4. Log predicted IDs for debugging
|
| 139 |
+
|
| 140 |
+
---
|
| 141 |
+
|
| 142 |
+
## Debugging Steps
|
| 143 |
+
|
| 144 |
+
### 1. Check API Logs
|
| 145 |
+
|
| 146 |
+
When processing audio, look for:
|
| 147 |
+
```
|
| 148 |
+
๐ Transcribed text: '...' (length: X)
|
| 149 |
+
๐ Final return - Actual: '...' (len: X), Target: '...' (len: Y)
|
| 150 |
+
๐ Result transcripts - Actual: '...' (len: X), Target: '...' (len: Y)
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
### 2. Check Django Logs
|
| 154 |
+
|
| 155 |
+
Look for:
|
| 156 |
+
```
|
| 157 |
+
๐ Final transcripts - Actual: X chars, Target: Y chars
|
| 158 |
+
๐ Saving transcripts - Actual: X chars, Target: Y chars
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
### 3. Check Database
|
| 162 |
+
|
| 163 |
+
Query the `AnalysisResult` table:
|
| 164 |
+
```sql
|
| 165 |
+
SELECT actual_transcript, target_transcript, LENGTH(actual_transcript) as actual_len, LENGTH(target_transcript) as target_len
|
| 166 |
+
FROM diagnosis_analysisresult
|
| 167 |
+
ORDER BY created_at DESC LIMIT 5;
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
### 4. Test API Directly
|
| 171 |
+
|
| 172 |
+
```bash
|
| 173 |
+
curl -X POST "http://localhost:7860/analyze" \
|
| 174 |
+
-F "[email protected]" \
|
| 175 |
+
-F "transcript=test transcript" \
|
| 176 |
+
-F "language=hin"
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
Check the response JSON for `actual_transcript` and `target_transcript`.
|
| 180 |
+
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
## Next Steps
|
| 184 |
+
|
| 185 |
+
1. **Rebuild Docker image** with latest changes
|
| 186 |
+
2. **Check logs** during audio processing
|
| 187 |
+
3. **Verify processor structure** - logs will show processor attributes
|
| 188 |
+
4. **Test with Hindi audio** - model is optimized for Hindi
|
| 189 |
+
5. **Check if model is loaded correctly** - verify HF_TOKEN is working
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
## Expected Log Output (Success)
|
| 194 |
+
|
| 195 |
+
```
|
| 196 |
+
๐ Initializing Advanced AI Engine on cpu...
|
| 197 |
+
โ
HF_TOKEN found - using authenticated model access
|
| 198 |
+
๐ Processor type: <class 'transformers.models.wav2vec2.processing_wav2vec2.Wav2Vec2Processor'>
|
| 199 |
+
๐ Processor attributes: ['batch_decode', 'decode', 'feature_extractor', 'tokenizer', ...]
|
| 200 |
+
๐ Tokenizer type: <class 'transformers.models.wav2vec2.tokenization_wav2vec2.Wav2Vec2CTCTokenizer'>
|
| 201 |
+
๐ Transcribed text: 'เคจเคฎเคธเฅเคคเฅ เคฎเฅเค เคนเคฟเคเคฆเฅ เคฌเฅเคฒ เคฐเคนเคพ เคนเฅเค' (length: 25)
|
| 202 |
+
๐ Final return - Actual: 'เคจเคฎเคธเฅเคคเฅ เคฎเฅเค เคนเคฟเคเคฆเฅ เคฌเฅเคฒ เคฐเคนเคพ เคนเฅเค' (len: 25), Target: '...' (len: X)
|
| 203 |
+
```
|
| 204 |
+
|
| 205 |
+
---
|
| 206 |
+
|
| 207 |
+
## If Still Empty
|
| 208 |
+
|
| 209 |
+
1. **Model may not be loaded correctly** - check HF_TOKEN
|
| 210 |
+
2. **Audio format issue** - ensure 16kHz mono WAV
|
| 211 |
+
3. **Model not producing predictions** - check predicted_ids in logs
|
| 212 |
+
4. **Tokenizer mismatch** - IndicWav2Vec may need special tokenizer initialization
|
| 213 |
+
|
diagnosis/ai_engine/detect_stuttering.py
CHANGED
|
@@ -2,29 +2,18 @@
|
|
| 2 |
import os
|
| 3 |
import librosa
|
| 4 |
import torch
|
| 5 |
-
import torchaudio
|
| 6 |
-
import torch.nn as nn
|
| 7 |
import logging
|
| 8 |
import numpy as np
|
| 9 |
-
import
|
| 10 |
-
from transformers import Wav2Vec2ForCTC, AutoProcessor, Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
|
| 11 |
import time
|
| 12 |
-
from collections import Counter
|
| 13 |
from dataclasses import dataclass, field
|
| 14 |
-
from typing import List, Dict, Any, Tuple
|
| 15 |
-
|
| 16 |
-
from scipy.spatial.distance import euclidean, cosine
|
| 17 |
-
from scipy.spatial import ConvexHull
|
| 18 |
-
from scipy.stats import kurtosis, skew
|
| 19 |
-
from fastdtw import fastdtw
|
| 20 |
-
from sklearn.preprocessing import StandardScaler
|
| 21 |
-
from sklearn.ensemble import IsolationForest
|
| 22 |
|
| 23 |
logger = logging.getLogger(__name__)
|
| 24 |
|
| 25 |
# === CONFIGURATION ===
|
| 26 |
-
MODEL_ID = "ai4bharat/indicwav2vec-hindi"
|
| 27 |
-
LID_MODEL_ID = "facebook/mms-lid-126"
|
| 28 |
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
| 29 |
HF_TOKEN = os.getenv("HF_TOKEN") # Hugging Face token for authenticated model access
|
| 30 |
|
|
@@ -85,56 +74,18 @@ class StutterEvent:
|
|
| 85 |
|
| 86 |
class AdvancedStutterDetector:
|
| 87 |
"""
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
โข Energy entropy - signal chaos measurement
|
| 101 |
-
|
| 102 |
-
[2] VOICE QUALITY METRICS (Parselmouth/Praat):
|
| 103 |
-
โข Jitter (>1% threshold) - pitch perturbation
|
| 104 |
-
โข Shimmer (>3% threshold) - amplitude perturbation
|
| 105 |
-
โข HNR (<15 dB threshold) - harmonics-to-noise ratio
|
| 106 |
-
|
| 107 |
-
[3] FORMANT ANALYSIS (Vowel Space):
|
| 108 |
-
โข Untreated stutterers show 70% vowel space reduction
|
| 109 |
-
โข F1-F2 centralization indicates restricted articulation
|
| 110 |
-
โข Post-treatment: vowel space normalizes
|
| 111 |
-
|
| 112 |
-
[4] DETECTION ALGORITHMS:
|
| 113 |
-
โข Prolongation: Spectral correlation >0.9 for >250ms
|
| 114 |
-
โข Blocks: Silence gaps >350ms mid-utterance
|
| 115 |
-
โข Repetitions: DTW distance <0.15 + text matching
|
| 116 |
-
โข Dysfluency: Entropy >3.5 or confidence <0.4
|
| 117 |
-
|
| 118 |
-
[5] ENSEMBLE DECISION FUSION:
|
| 119 |
-
โข Multi-layer cascade: Block > Repetition > Prolongation
|
| 120 |
-
โข Anomaly detection (Isolation Forest) for outliers
|
| 121 |
-
โข Speaking-rate normalization for adaptive thresholds
|
| 122 |
-
|
| 123 |
-
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 124 |
-
KEY IMPROVEMENTS FROM ORIGINAL CODE:
|
| 125 |
-
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 126 |
-
|
| 127 |
-
โ
Praat-based voice quality analysis (jitter/shimmer/HNR)
|
| 128 |
-
โ
Formant tracking with vowel space area calculation
|
| 129 |
-
โ
Zero-crossing rate for phonation analysis
|
| 130 |
-
โ
Spectral flux for rapid acoustic changes
|
| 131 |
-
โ
Enhanced entropy calculation with frame-level detail
|
| 132 |
-
โ
Isolation Forest anomaly detection
|
| 133 |
-
โ
Multi-feature fusion with weighted scoring
|
| 134 |
-
โ
Adaptive thresholds based on speaking rate
|
| 135 |
-
โ
Comprehensive clinical severity mapping
|
| 136 |
-
|
| 137 |
-
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 138 |
"""
|
| 139 |
|
| 140 |
def __init__(self):
|
|
@@ -158,63 +109,29 @@ class AdvancedStutterDetector:
|
|
| 158 |
|
| 159 |
# Debug: Log processor structure
|
| 160 |
logger.info(f"๐ Processor type: {type(self.processor)}")
|
| 161 |
-
logger.info(f"๐ Processor attributes: {[attr for attr in dir(self.processor) if not attr.startswith('_')]}")
|
| 162 |
if hasattr(self.processor, 'tokenizer'):
|
| 163 |
logger.info(f"๐ Tokenizer type: {type(self.processor.tokenizer)}")
|
| 164 |
if hasattr(self.processor, 'feature_extractor'):
|
| 165 |
logger.info(f"๐ Feature extractor type: {type(self.processor.feature_extractor)}")
|
| 166 |
-
self.loaded_adapters = set() # Keep for backward compatibility but not used with indicwav2vec
|
| 167 |
-
|
| 168 |
-
# Anomaly Detection Model (for outlier stutter events)
|
| 169 |
-
self.anomaly_detector = IsolationForest(
|
| 170 |
-
contamination=0.1, # Expect 10% of frames to be anomalous
|
| 171 |
-
random_state=42
|
| 172 |
-
)
|
| 173 |
|
| 174 |
-
logger.info("โ
|
| 175 |
except Exception as e:
|
| 176 |
logger.error(f"๐ฅ Engine Failure: {e}")
|
| 177 |
raise
|
| 178 |
|
| 179 |
def _init_common_adapters(self):
|
| 180 |
-
"""
|
| 181 |
-
# IndicWav2Vec Hindi model is pre-trained for Hindi, no adapters needed
|
| 182 |
pass
|
| 183 |
|
| 184 |
-
def _detect_language_robust(self, audio_path: str) -> str:
|
| 185 |
-
"""Detect language using MMS LID model"""
|
| 186 |
-
try:
|
| 187 |
-
from transformers import Wav2Vec2ForSequenceClassification
|
| 188 |
-
lid_model = Wav2Vec2ForSequenceClassification.from_pretrained(
|
| 189 |
-
LID_MODEL_ID,
|
| 190 |
-
token=HF_TOKEN
|
| 191 |
-
).to(DEVICE)
|
| 192 |
-
lid_processor = AutoFeatureExtractor.from_pretrained(
|
| 193 |
-
LID_MODEL_ID,
|
| 194 |
-
token=HF_TOKEN
|
| 195 |
-
)
|
| 196 |
-
|
| 197 |
-
audio, sr = librosa.load(audio_path, sr=16000)
|
| 198 |
-
inputs = lid_processor(audio, sampling_rate=16000, return_tensors="pt").to(DEVICE)
|
| 199 |
-
|
| 200 |
-
with torch.no_grad():
|
| 201 |
-
outputs = lid_model(**inputs)
|
| 202 |
-
predicted_id = torch.argmax(outputs.logits, dim=-1).item()
|
| 203 |
-
|
| 204 |
-
# Map to language code (simplified - would need actual label mapping)
|
| 205 |
-
return 'eng' # Default fallback
|
| 206 |
-
except Exception as e:
|
| 207 |
-
logger.warning(f"Language detection failed: {e}, defaulting to 'eng'")
|
| 208 |
-
return 'eng'
|
| 209 |
-
|
| 210 |
def _activate_adapter(self, lang_code: str):
|
| 211 |
-
"""
|
| 212 |
-
|
| 213 |
-
# Log for debugging but no action required
|
| 214 |
-
if lang_code != 'hin':
|
| 215 |
-
logger.info(f"Note: Using Hindi-specific model (indicwav2vec-hindi), language code '{lang_code}' requested but model is optimized for Hindi")
|
| 216 |
pass
|
| 217 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
def _extract_comprehensive_features(self, audio: np.ndarray, sr: int, audio_path: str) -> Dict[str, Any]:
|
| 219 |
"""Extract multi-modal acoustic features"""
|
| 220 |
features = {}
|
|
@@ -708,131 +625,89 @@ class AdvancedStutterDetector:
|
|
| 708 |
}
|
| 709 |
|
| 710 |
|
| 711 |
-
def analyze_audio(self, audio_path: str, proper_transcript: str = "", language: str = '
|
| 712 |
"""
|
| 713 |
-
Main analysis pipeline
|
|
|
|
|
|
|
| 714 |
"""
|
| 715 |
start_time = time.time()
|
| 716 |
|
| 717 |
-
# === STEP 1:
|
| 718 |
-
# Note: indicwav2vec-hindi is optimized for Hindi, but can handle other languages
|
| 719 |
-
if language == 'auto':
|
| 720 |
-
lang_code = self._detect_language_robust(audio_path)
|
| 721 |
-
else:
|
| 722 |
-
lang_code = INDIAN_LANGUAGES.get(language.lower(), 'hin') # Default to Hindi for indicwav2vec
|
| 723 |
-
self._activate_adapter(lang_code)
|
| 724 |
-
|
| 725 |
-
# === STEP 2: Audio Loading & Preprocessing ===
|
| 726 |
audio, sr = librosa.load(audio_path, sr=16000)
|
| 727 |
duration = librosa.get_duration(y=audio, sr=sr)
|
| 728 |
|
| 729 |
-
# === STEP
|
| 730 |
-
features = self._extract_comprehensive_features(audio, sr, audio_path)
|
| 731 |
-
|
| 732 |
-
# === STEP 4: Wav2Vec2 Transcription & Uncertainty ===
|
| 733 |
transcript, word_timestamps, logits = self._transcribe_with_timestamps(audio)
|
| 734 |
-
logger.info(f"๐
|
|
|
|
|
|
|
| 735 |
entropy_score, low_conf_regions = self._calculate_uncertainty(logits)
|
|
|
|
|
|
|
| 736 |
|
| 737 |
-
# === STEP
|
| 738 |
-
|
| 739 |
-
|
| 740 |
-
# === STEP 6: Multi-Layer Stutter Detection ===
|
| 741 |
events = []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 742 |
|
| 743 |
-
#
|
| 744 |
-
events.extend(self._detect_prolongations_advanced(
|
| 745 |
-
features['mfcc'],
|
| 746 |
-
features['spectral_flux'],
|
| 747 |
-
speaking_rate,
|
| 748 |
-
word_timestamps
|
| 749 |
-
))
|
| 750 |
-
|
| 751 |
-
# Layer B: Silence Block Detection
|
| 752 |
-
events.extend(self._detect_blocks_enhanced(
|
| 753 |
-
audio, sr,
|
| 754 |
-
features['rms_energy'],
|
| 755 |
-
features['zcr'],
|
| 756 |
-
word_timestamps,
|
| 757 |
-
speaking_rate
|
| 758 |
-
))
|
| 759 |
-
|
| 760 |
-
# Layer C: DTW-Based Repetition Detection
|
| 761 |
-
events.extend(self._detect_repetitions_advanced(
|
| 762 |
-
features['mfcc'],
|
| 763 |
-
features['formants'],
|
| 764 |
-
word_timestamps,
|
| 765 |
-
transcript,
|
| 766 |
-
speaking_rate
|
| 767 |
-
))
|
| 768 |
-
|
| 769 |
-
# Layer D: Voice Quality Dysfluencies (Jitter/Shimmer)
|
| 770 |
-
events.extend(self._detect_voice_quality_issues(
|
| 771 |
-
audio_path,
|
| 772 |
-
word_timestamps,
|
| 773 |
-
features['voice_quality']
|
| 774 |
-
))
|
| 775 |
-
|
| 776 |
-
# Layer E: Entropy-Based Uncertainty Events
|
| 777 |
-
for region in low_conf_regions:
|
| 778 |
-
if not self._is_overlapping(region['time'], events):
|
| 779 |
-
events.append(StutterEvent(
|
| 780 |
-
type='dysfluency',
|
| 781 |
-
start=region['time'],
|
| 782 |
-
end=region['time'] + 0.3,
|
| 783 |
-
text="<uncertainty>",
|
| 784 |
-
confidence=0.4,
|
| 785 |
-
acoustic_features={'entropy': entropy_score}
|
| 786 |
-
))
|
| 787 |
-
|
| 788 |
-
# Layer F: Anomaly Detection (Isolation Forest)
|
| 789 |
-
events = self._detect_anomalies(events, features)
|
| 790 |
-
|
| 791 |
-
# === STEP 7: Event Fusion & Deduplication ===
|
| 792 |
-
cleaned_events = self._deduplicate_events_cascade(events)
|
| 793 |
-
|
| 794 |
-
# === STEP 8: Clinical Metrics & Severity Assessment ===
|
| 795 |
-
metrics = self._calculate_clinical_metrics(
|
| 796 |
-
cleaned_events,
|
| 797 |
-
duration,
|
| 798 |
-
speaking_rate,
|
| 799 |
-
features
|
| 800 |
-
)
|
| 801 |
-
|
| 802 |
-
# Severity upgrade if global confidence is very low
|
| 803 |
-
if metrics['confidence'] < 0.6 and metrics['severity_label'] == 'none':
|
| 804 |
-
metrics['severity_label'] = 'mild'
|
| 805 |
-
metrics['severity_score'] = max(metrics['severity_score'], 5.0)
|
| 806 |
-
|
| 807 |
-
# === STEP 9: Return Comprehensive Report ===
|
| 808 |
-
# Ensure transcripts are not None
|
| 809 |
actual_transcript = transcript if transcript else ""
|
| 810 |
-
target_transcript = proper_transcript if proper_transcript else
|
| 811 |
|
| 812 |
-
logger.info(f"๐ Final
|
| 813 |
|
| 814 |
return {
|
| 815 |
'actual_transcript': actual_transcript,
|
| 816 |
'target_transcript': target_transcript,
|
| 817 |
-
'mismatched_chars': [f"{r['time']}s" for r in low_conf_regions],
|
| 818 |
-
'mismatch_percentage':
|
| 819 |
'ctc_loss_score': round(entropy_score, 4),
|
| 820 |
-
'stutter_timestamps': [self._event_to_dict(e) for e in
|
| 821 |
-
'total_stutter_duration':
|
| 822 |
-
'stutter_frequency':
|
| 823 |
-
'severity':
|
| 824 |
-
'confidence_score':
|
| 825 |
-
'speaking_rate_sps': round(
|
| 826 |
-
'voice_quality_metrics': features['voice_quality'],
|
| 827 |
-
'formant_analysis': features['formant_summary'],
|
| 828 |
-
'acoustic_features': {
|
| 829 |
-
'avg_mfcc_variance': float(np.var(features['mfcc'])),
|
| 830 |
-
'avg_zcr': float(np.mean(features['zcr'])),
|
| 831 |
-
'spectral_flux_mean': float(np.mean(features['spectral_flux'])),
|
| 832 |
-
'energy_entropy': float(np.mean(features['energy_entropy']))
|
| 833 |
-
},
|
| 834 |
'analysis_duration_seconds': round(time.time() - start_time, 2),
|
| 835 |
-
'model_version':
|
| 836 |
}
|
| 837 |
|
| 838 |
|
|
|
|
| 2 |
import os
|
| 3 |
import librosa
|
| 4 |
import torch
|
|
|
|
|
|
|
| 5 |
import logging
|
| 6 |
import numpy as np
|
| 7 |
+
from transformers import Wav2Vec2ForCTC, AutoProcessor
|
|
|
|
| 8 |
import time
|
|
|
|
| 9 |
from dataclasses import dataclass, field
|
| 10 |
+
from typing import List, Dict, Any, Tuple
|
| 11 |
+
# Simplified: Only using ASR transcription, removed complex signal processing libraries
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
logger = logging.getLogger(__name__)
|
| 14 |
|
| 15 |
# === CONFIGURATION ===
|
| 16 |
+
MODEL_ID = "ai4bharat/indicwav2vec-hindi" # Only model used - IndicWav2Vec Hindi for ASR
|
|
|
|
| 17 |
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
| 18 |
HF_TOKEN = os.getenv("HF_TOKEN") # Hugging Face token for authenticated model access
|
| 19 |
|
|
|
|
| 74 |
|
| 75 |
class AdvancedStutterDetector:
|
| 76 |
"""
|
| 77 |
+
๐ค IndicWav2Vec Hindi ASR Engine
|
| 78 |
+
|
| 79 |
+
Simplified engine using ONLY ai4bharat/indicwav2vec-hindi for Automatic Speech Recognition.
|
| 80 |
+
|
| 81 |
+
Features:
|
| 82 |
+
- Speech-to-text transcription using IndicWav2Vec Hindi model
|
| 83 |
+
- Text-based stutter analysis from transcription
|
| 84 |
+
- Confidence scoring from model predictions
|
| 85 |
+
- Basic dysfluency detection from transcript patterns
|
| 86 |
+
|
| 87 |
+
Model: ai4bharat/indicwav2vec-hindi (Wav2Vec2ForCTC)
|
| 88 |
+
Purpose: Automatic Speech Recognition (ASR) for Hindi and Indian languages
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
"""
|
| 90 |
|
| 91 |
def __init__(self):
|
|
|
|
| 109 |
|
| 110 |
# Debug: Log processor structure
|
| 111 |
logger.info(f"๐ Processor type: {type(self.processor)}")
|
|
|
|
| 112 |
if hasattr(self.processor, 'tokenizer'):
|
| 113 |
logger.info(f"๐ Tokenizer type: {type(self.processor.tokenizer)}")
|
| 114 |
if hasattr(self.processor, 'feature_extractor'):
|
| 115 |
logger.info(f"๐ Feature extractor type: {type(self.processor.feature_extractor)}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
+
logger.info("โ
IndicWav2Vec Hindi ASR Engine Loaded")
|
| 118 |
except Exception as e:
|
| 119 |
logger.error(f"๐ฅ Engine Failure: {e}")
|
| 120 |
raise
|
| 121 |
|
| 122 |
def _init_common_adapters(self):
|
| 123 |
+
"""Not applicable - IndicWav2Vec Hindi doesn't use adapters"""
|
|
|
|
| 124 |
pass
|
| 125 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
def _activate_adapter(self, lang_code: str):
|
| 127 |
+
"""Not applicable - IndicWav2Vec Hindi doesn't use adapters"""
|
| 128 |
+
logger.info(f"Using IndicWav2Vec Hindi model (optimized for Hindi)")
|
|
|
|
|
|
|
|
|
|
| 129 |
pass
|
| 130 |
|
| 131 |
+
# ===== LEGACY METHODS (NOT USED IN ASR-ONLY MODE) =====
|
| 132 |
+
# These methods are kept for reference but not called in the simplified ASR pipeline
|
| 133 |
+
# They require additional libraries (parselmouth, fastdtw, sklearn) that are not needed for ASR-only mode
|
| 134 |
+
|
| 135 |
def _extract_comprehensive_features(self, audio: np.ndarray, sr: int, audio_path: str) -> Dict[str, Any]:
|
| 136 |
"""Extract multi-modal acoustic features"""
|
| 137 |
features = {}
|
|
|
|
| 625 |
}
|
| 626 |
|
| 627 |
|
| 628 |
+
def analyze_audio(self, audio_path: str, proper_transcript: str = "", language: str = 'hindi') -> dict:
|
| 629 |
"""
|
| 630 |
+
Main ASR analysis pipeline using IndicWav2Vec Hindi model
|
| 631 |
+
|
| 632 |
+
Focus: Automatic Speech Recognition (ASR) transcription only
|
| 633 |
"""
|
| 634 |
start_time = time.time()
|
| 635 |
|
| 636 |
+
# === STEP 1: Audio Loading & Preprocessing ===
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 637 |
audio, sr = librosa.load(audio_path, sr=16000)
|
| 638 |
duration = librosa.get_duration(y=audio, sr=sr)
|
| 639 |
|
| 640 |
+
# === STEP 2: ASR Transcription using IndicWav2Vec Hindi ===
|
|
|
|
|
|
|
|
|
|
| 641 |
transcript, word_timestamps, logits = self._transcribe_with_timestamps(audio)
|
| 642 |
+
logger.info(f"๐ ASR Transcription: '{transcript}' (length: {len(transcript)}, words: {len(word_timestamps)})")
|
| 643 |
+
|
| 644 |
+
# === STEP 3: Calculate Confidence from Model Predictions ===
|
| 645 |
entropy_score, low_conf_regions = self._calculate_uncertainty(logits)
|
| 646 |
+
avg_confidence = 1.0 - (entropy_score / 10.0) if entropy_score > 0 else 0.8
|
| 647 |
+
avg_confidence = max(0.0, min(1.0, avg_confidence))
|
| 648 |
|
| 649 |
+
# === STEP 4: Basic Text-based Analysis ===
|
| 650 |
+
# Simple text-based stutter detection (repetitions, hesitations)
|
|
|
|
|
|
|
| 651 |
events = []
|
| 652 |
+
if transcript:
|
| 653 |
+
words = transcript.split()
|
| 654 |
+
# Detect word repetitions
|
| 655 |
+
for i in range(len(words) - 1):
|
| 656 |
+
if words[i] == words[i+1] and i < len(word_timestamps) - 1:
|
| 657 |
+
events.append(StutterEvent(
|
| 658 |
+
type='repetition',
|
| 659 |
+
start=word_timestamps[i]['start'] if i < len(word_timestamps) else 0,
|
| 660 |
+
end=word_timestamps[i+1]['end'] if i+1 < len(word_timestamps) else 0,
|
| 661 |
+
text=words[i],
|
| 662 |
+
confidence=0.7
|
| 663 |
+
))
|
| 664 |
+
|
| 665 |
+
# Add low confidence regions as potential dysfluencies
|
| 666 |
+
for region in low_conf_regions[:5]: # Limit to first 5
|
| 667 |
+
events.append(StutterEvent(
|
| 668 |
+
type='dysfluency',
|
| 669 |
+
start=region['time'],
|
| 670 |
+
end=region['time'] + 0.3,
|
| 671 |
+
text="<uncertainty>",
|
| 672 |
+
confidence=0.4,
|
| 673 |
+
acoustic_features={'entropy': entropy_score}
|
| 674 |
+
))
|
| 675 |
+
|
| 676 |
+
# === STEP 5: Calculate Basic Metrics ===
|
| 677 |
+
total_duration = sum(e.end - e.start for e in events)
|
| 678 |
+
frequency = (len(events) / duration * 60) if duration > 0 else 0
|
| 679 |
+
stutter_percentage = (total_duration / duration * 100) if duration > 0 else 0
|
| 680 |
+
|
| 681 |
+
# Simple severity assessment
|
| 682 |
+
if stutter_percentage < 5:
|
| 683 |
+
severity = 'none'
|
| 684 |
+
elif stutter_percentage < 15:
|
| 685 |
+
severity = 'mild'
|
| 686 |
+
elif stutter_percentage < 30:
|
| 687 |
+
severity = 'moderate'
|
| 688 |
+
else:
|
| 689 |
+
severity = 'severe'
|
| 690 |
|
| 691 |
+
# === STEP 6: Return ASR Results ===
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 692 |
actual_transcript = transcript if transcript else ""
|
| 693 |
+
target_transcript = proper_transcript if proper_transcript else ""
|
| 694 |
|
| 695 |
+
logger.info(f"๐ Final ASR result - Actual: '{actual_transcript}' (len: {len(actual_transcript)}), Target: '{target_transcript}' (len: {len(target_transcript)})")
|
| 696 |
|
| 697 |
return {
|
| 698 |
'actual_transcript': actual_transcript,
|
| 699 |
'target_transcript': target_transcript,
|
| 700 |
+
'mismatched_chars': [f"{r['time']:.2f}s" for r in low_conf_regions[:10]],
|
| 701 |
+
'mismatch_percentage': round(stutter_percentage, 2),
|
| 702 |
'ctc_loss_score': round(entropy_score, 4),
|
| 703 |
+
'stutter_timestamps': [self._event_to_dict(e) for e in events],
|
| 704 |
+
'total_stutter_duration': round(total_duration, 2),
|
| 705 |
+
'stutter_frequency': round(frequency, 2),
|
| 706 |
+
'severity': severity,
|
| 707 |
+
'confidence_score': round(avg_confidence, 2),
|
| 708 |
+
'speaking_rate_sps': round(len(word_timestamps) / duration if duration > 0 else 0, 2),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 709 |
'analysis_duration_seconds': round(time.time() - start_time, 2),
|
| 710 |
+
'model_version': 'indicwav2vec-hindi-asr-v1'
|
| 711 |
}
|
| 712 |
|
| 713 |
|