anfastech commited on
Commit
e7e9fa8
ยท
1 Parent(s): 439ae4d

Updation: Simplifying the AI engine to use only ai4bharat/indicwav2vec-hindi for ASR.

Browse files
MODEL_SUMMARY.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI Engine Model Summary
2
+
3
+ ## Simplified ASR-Only Configuration
4
+
5
+ This engine has been simplified to use **ONLY** the IndicWav2Vec Hindi model for Automatic Speech Recognition (ASR).
6
+
7
+ ---
8
+
9
+ ## Active Model
10
+
11
+ ### 1. IndicWav2Vec Hindi (Primary & Only Model)
12
+ - **Model ID**: `ai4bharat/indicwav2vec-hindi`
13
+ - **Type**: `Wav2Vec2ForCTC`
14
+ - **Purpose**: Automatic Speech Recognition (ASR) for Hindi and Indian languages
15
+ - **Status**: โœ… Active - Loaded at startup
16
+ - **Location**: `detect_stuttering.py` lines 26, 148-156
17
+ - **Authentication**: Requires `HF_TOKEN` environment variable
18
+
19
+ **Features:**
20
+ - Speech-to-text transcription
21
+ - Confidence scoring from model predictions
22
+ - Text-based stutter analysis (simple repetition detection)
23
+
24
+ ---
25
+
26
+ ## Removed Models
27
+
28
+ The following models have been **removed** to simplify the engine:
29
+
30
+ 1. โŒ **MMS Language Identification (LID)** - `facebook/mms-lid-126`
31
+ - Previously used for language detection
32
+ - No longer needed - IndicWav2Vec handles Hindi natively
33
+
34
+ 2. โŒ **Isolation Forest** (sklearn)
35
+ - Previously used for anomaly detection
36
+ - Removed - using simple text-based analysis instead
37
+
38
+ ---
39
+
40
+ ## Removed Libraries
41
+
42
+ The following signal processing libraries are no longer used:
43
+
44
+ - โŒ `parselmouth` (Praat) - Voice quality analysis
45
+ - โŒ `fastdtw` - Repetition detection via DTW
46
+ - โŒ `sklearn` - Machine learning algorithms
47
+ - โŒ Complex acoustic feature extraction (MFCC, formants, etc.)
48
+
49
+ ---
50
+
51
+ ## Current Pipeline
52
+
53
+ ```
54
+ Audio Input
55
+ โ†“
56
+ IndicWav2Vec Hindi ASR
57
+ โ†“
58
+ Text Transcription
59
+ โ†“
60
+ Basic Text Analysis
61
+ โ†“
62
+ Results (transcript + simple stutter detection)
63
+ ```
64
+
65
+ ---
66
+
67
+ ## API Response Format
68
+
69
+ The simplified engine returns:
70
+
71
+ ```json
72
+ {
73
+ "actual_transcript": "transcribed text",
74
+ "target_transcript": "expected text (if provided)",
75
+ "mismatched_chars": ["timestamps of low confidence regions"],
76
+ "mismatch_percentage": 0.0,
77
+ "ctc_loss_score": 0.0,
78
+ "stutter_timestamps": [{"type": "repetition", "start": 0.0, "end": 0.5, ...}],
79
+ "total_stutter_duration": 0.0,
80
+ "stutter_frequency": 0.0,
81
+ "severity": "none|mild|moderate|severe",
82
+ "confidence_score": 0.8,
83
+ "speaking_rate_sps": 0.0,
84
+ "analysis_duration_seconds": 0.0,
85
+ "model_version": "indicwav2vec-hindi-asr-v1"
86
+ }
87
+ ```
88
+
89
+ ---
90
+
91
+ ## Dependencies
92
+
93
+ **Required:**
94
+ - `transformers` 4.35.0 - For IndicWav2Vec model
95
+ - `torch` 2.0.1 - PyTorch backend
96
+ - `librosa` โ‰ฅ0.10.0 - Audio loading (16kHz resampling)
97
+ - `numpy` - Array operations
98
+
99
+ **Optional (for legacy methods, not used in ASR mode):**
100
+ - `parselmouth` - Voice quality (not used)
101
+ - `fastdtw` - DTW algorithm (not used)
102
+ - `sklearn` - ML algorithms (not used)
103
+
104
+ ---
105
+
106
+ ## Usage
107
+
108
+ ```python
109
+ from diagnosis.ai_engine.detect_stuttering import get_stutter_detector
110
+
111
+ detector = get_stutter_detector()
112
+ result = detector.analyze_audio(
113
+ audio_path="path/to/audio.wav",
114
+ proper_transcript="expected text", # optional
115
+ language="hindi" # default: hindi
116
+ )
117
+
118
+ print(result['actual_transcript']) # ASR transcription
119
+ ```
120
+
121
+ ---
122
+
123
+ ## Notes
124
+
125
+ - The engine focuses **only** on ASR transcription
126
+ - Stutter detection is simplified to text-based repetition analysis
127
+ - No complex acoustic feature extraction
128
+ - Faster and lighter than the previous multi-model approach
129
+ - Optimized for Hindi but can handle other Indian languages
130
+
TRANSCRIPT_DEBUG.md ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Transcript Debugging Guide
2
+
3
+ ## Issue: Empty Transcripts ("No transcript available")
4
+
5
+ ## Complete Flow Analysis
6
+
7
+ ### 1. Django App โ†’ API Request (`slaq-version-c/diagnosis/ai_engine/detect_stuttering.py`)
8
+
9
+ **Location:** Line 269-274
10
+ ```python
11
+ response = requests.post(
12
+ self.api_url,
13
+ files=files,
14
+ data={
15
+ "transcript": proper_transcript if proper_transcript else "",
16
+ "language": lang_code,
17
+ },
18
+ timeout=self.api_timeout
19
+ )
20
+ ```
21
+
22
+ **Status:** โœ… Sending transcript parameter correctly
23
+
24
+ ---
25
+
26
+ ### 2. API Receives Request (`slaq-version-c-ai-enginee/app.py`)
27
+
28
+ **Location:** Line 70-73
29
+ ```python
30
+ @app.post("/analyze")
31
+ async def analyze_audio(
32
+ audio: UploadFile = File(...),
33
+ transcript: str = Form("") # โœ… Fixed: Now uses Form() for multipart
34
+ ):
35
+ ```
36
+
37
+ **Status:** โœ… Fixed - Now correctly receives transcript via Form()
38
+
39
+ ---
40
+
41
+ ### 3. API Calls Model (`slaq-version-c-ai-enginee/app.py`)
42
+
43
+ **Location:** Line 106
44
+ ```python
45
+ result = detector.analyze_audio(temp_file, transcript)
46
+ ```
47
+
48
+ **Status:** โœ… Passing transcript correctly
49
+
50
+ ---
51
+
52
+ ### 4. Model Transcribes Audio (`slaq-version-c-ai-enginee/diagnosis/ai_engine/detect_stuttering.py`)
53
+
54
+ **Location:** Line 313-369 (`_transcribe_with_timestamps`)
55
+
56
+ **Potential Issues:**
57
+ - โ“ IndicWav2Vec decoding might not work with `processor.batch_decode()`
58
+ - โ“ Need to use tokenizer directly
59
+ - โ“ Model might not be producing valid predictions
60
+
61
+ **Status:** โš ๏ธ **LIKELY ISSUE HERE** - Decoding method may be incorrect
62
+
63
+ ---
64
+
65
+ ### 5. Model Returns Result (`slaq-version-c-ai-enginee/diagnosis/ai_engine/detect_stuttering.py`)
66
+
67
+ **Location:** Line 787-794
68
+ ```python
69
+ actual_transcript = transcript if transcript else ""
70
+ target_transcript = proper_transcript if proper_transcript else transcript if transcript else ""
71
+
72
+ return {
73
+ 'actual_transcript': actual_transcript,
74
+ 'target_transcript': target_transcript,
75
+ ...
76
+ }
77
+ ```
78
+
79
+ **Status:** โœ… Returns transcripts correctly (if transcript is not empty)
80
+
81
+ ---
82
+
83
+ ### 6. API Returns Response (`slaq-version-c-ai-enginee/app.py`)
84
+
85
+ **Location:** Line 109-113
86
+ ```python
87
+ actual = result.get('actual_transcript', '')
88
+ target = result.get('target_transcript', '')
89
+ logger.info(f"๐Ÿ“ Result transcripts - Actual: '{actual[:100]}' (len: {len(actual)}), Target: '{target[:100]}' (len: {len(target)})")
90
+ return result
91
+ ```
92
+
93
+ **Status:** โœ… Returns JSON with transcripts
94
+
95
+ ---
96
+
97
+ ### 7. Django Receives Response (`slaq-version-c/diagnosis/ai_engine/detect_stuttering.py`)
98
+
99
+ **Location:** Line 279-410
100
+ ```python
101
+ result = response.json()
102
+ # ... formatting ...
103
+ actual_transcript = str(api_result.get('actual_transcript', '')).strip()
104
+ target_transcript = str(api_result.get('target_transcript', '')).strip()
105
+ ```
106
+
107
+ **Status:** โœ… Extracts transcripts correctly
108
+
109
+ ---
110
+
111
+ ### 8. Django Saves to Database (`slaq-version-c/diagnosis/tasks.py`)
112
+
113
+ **Location:** Line 141-142
114
+ ```python
115
+ actual_transcript=actual_transcript,
116
+ target_transcript=target_transcript,
117
+ ```
118
+
119
+ **Status:** โœ… Saves correctly
120
+
121
+ ---
122
+
123
+ ## Root Cause Analysis
124
+
125
+ ### Most Likely Issue: Transcription Decoding
126
+
127
+ The IndicWav2Vec model (`ai4bharat/indicwav2vec-hindi`) may require:
128
+ 1. **Direct tokenizer access** instead of `processor.batch_decode()`
129
+ 2. **CTC decoding** with proper tokenizer
130
+ 3. **Special handling** for Indic scripts
131
+
132
+ ### Fix Applied
133
+
134
+ Updated `_transcribe_with_timestamps()` to:
135
+ 1. Try multiple decoding methods
136
+ 2. Use tokenizer directly if available
137
+ 3. Add comprehensive error logging
138
+ 4. Log predicted IDs for debugging
139
+
140
+ ---
141
+
142
+ ## Debugging Steps
143
+
144
+ ### 1. Check API Logs
145
+
146
+ When processing audio, look for:
147
+ ```
148
+ ๐Ÿ“ Transcribed text: '...' (length: X)
149
+ ๐Ÿ“ Final return - Actual: '...' (len: X), Target: '...' (len: Y)
150
+ ๐Ÿ“ Result transcripts - Actual: '...' (len: X), Target: '...' (len: Y)
151
+ ```
152
+
153
+ ### 2. Check Django Logs
154
+
155
+ Look for:
156
+ ```
157
+ ๐Ÿ“ Final transcripts - Actual: X chars, Target: Y chars
158
+ ๐Ÿ“ Saving transcripts - Actual: X chars, Target: Y chars
159
+ ```
160
+
161
+ ### 3. Check Database
162
+
163
+ Query the `AnalysisResult` table:
164
+ ```sql
165
+ SELECT actual_transcript, target_transcript, LENGTH(actual_transcript) as actual_len, LENGTH(target_transcript) as target_len
166
+ FROM diagnosis_analysisresult
167
+ ORDER BY created_at DESC LIMIT 5;
168
+ ```
169
+
170
+ ### 4. Test API Directly
171
+
172
+ ```bash
173
+ curl -X POST "http://localhost:7860/analyze" \
174
175
+ -F "transcript=test transcript" \
176
+ -F "language=hin"
177
+ ```
178
+
179
+ Check the response JSON for `actual_transcript` and `target_transcript`.
180
+
181
+ ---
182
+
183
+ ## Next Steps
184
+
185
+ 1. **Rebuild Docker image** with latest changes
186
+ 2. **Check logs** during audio processing
187
+ 3. **Verify processor structure** - logs will show processor attributes
188
+ 4. **Test with Hindi audio** - model is optimized for Hindi
189
+ 5. **Check if model is loaded correctly** - verify HF_TOKEN is working
190
+
191
+ ---
192
+
193
+ ## Expected Log Output (Success)
194
+
195
+ ```
196
+ ๐Ÿš€ Initializing Advanced AI Engine on cpu...
197
+ โœ… HF_TOKEN found - using authenticated model access
198
+ ๐Ÿ“‹ Processor type: <class 'transformers.models.wav2vec2.processing_wav2vec2.Wav2Vec2Processor'>
199
+ ๐Ÿ“‹ Processor attributes: ['batch_decode', 'decode', 'feature_extractor', 'tokenizer', ...]
200
+ ๐Ÿ“‹ Tokenizer type: <class 'transformers.models.wav2vec2.tokenization_wav2vec2.Wav2Vec2CTCTokenizer'>
201
+ ๐Ÿ“ Transcribed text: 'เคจเคฎเคธเฅเคคเฅ‡ เคฎเฅˆเค‚ เคนเคฟเค‚เคฆเฅ€ เคฌเฅ‹เคฒ เคฐเคนเคพ เคนเฅ‚เค‚' (length: 25)
202
+ ๐Ÿ“ Final return - Actual: 'เคจเคฎเคธเฅเคคเฅ‡ เคฎเฅˆเค‚ เคนเคฟเค‚เคฆเฅ€ เคฌเฅ‹เคฒ เคฐเคนเคพ เคนเฅ‚เค‚' (len: 25), Target: '...' (len: X)
203
+ ```
204
+
205
+ ---
206
+
207
+ ## If Still Empty
208
+
209
+ 1. **Model may not be loaded correctly** - check HF_TOKEN
210
+ 2. **Audio format issue** - ensure 16kHz mono WAV
211
+ 3. **Model not producing predictions** - check predicted_ids in logs
212
+ 4. **Tokenizer mismatch** - IndicWav2Vec may need special tokenizer initialization
213
+
diagnosis/ai_engine/detect_stuttering.py CHANGED
@@ -2,29 +2,18 @@
2
  import os
3
  import librosa
4
  import torch
5
- import torchaudio
6
- import torch.nn as nn
7
  import logging
8
  import numpy as np
9
- import parselmouth
10
- from transformers import Wav2Vec2ForCTC, AutoProcessor, Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
11
  import time
12
- from collections import Counter
13
  from dataclasses import dataclass, field
14
- from typing import List, Dict, Any, Tuple, Optional
15
- from scipy.signal import correlate, butter, filtfilt
16
- from scipy.spatial.distance import euclidean, cosine
17
- from scipy.spatial import ConvexHull
18
- from scipy.stats import kurtosis, skew
19
- from fastdtw import fastdtw
20
- from sklearn.preprocessing import StandardScaler
21
- from sklearn.ensemble import IsolationForest
22
 
23
  logger = logging.getLogger(__name__)
24
 
25
  # === CONFIGURATION ===
26
- MODEL_ID = "ai4bharat/indicwav2vec-hindi"
27
- LID_MODEL_ID = "facebook/mms-lid-126"
28
  DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
29
  HF_TOKEN = os.getenv("HF_TOKEN") # Hugging Face token for authenticated model access
30
 
@@ -85,56 +74,18 @@ class StutterEvent:
85
 
86
  class AdvancedStutterDetector:
87
  """
88
- ๐Ÿง  2024-2025 State-of-the-Art Stuttering Detection Engine
89
-
90
- โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
91
- RESEARCH FOUNDATION (Latest Publications):
92
- โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
93
-
94
- [1] ACOUSTIC FEATURES:
95
- โ€ข MFCC (20 coefficients) - spectral envelope
96
- โ€ข Formant tracking (F1-F4) - vowel space analysis
97
- โ€ข Pitch contour (F0) - intonation patterns
98
- โ€ข Zero-Crossing Rate - voiced/unvoiced classification
99
- โ€ข Spectral flux - rapid spectral changes
100
- โ€ข Energy entropy - signal chaos measurement
101
-
102
- [2] VOICE QUALITY METRICS (Parselmouth/Praat):
103
- โ€ข Jitter (>1% threshold) - pitch perturbation
104
- โ€ข Shimmer (>3% threshold) - amplitude perturbation
105
- โ€ข HNR (<15 dB threshold) - harmonics-to-noise ratio
106
-
107
- [3] FORMANT ANALYSIS (Vowel Space):
108
- โ€ข Untreated stutterers show 70% vowel space reduction
109
- โ€ข F1-F2 centralization indicates restricted articulation
110
- โ€ข Post-treatment: vowel space normalizes
111
-
112
- [4] DETECTION ALGORITHMS:
113
- โ€ข Prolongation: Spectral correlation >0.9 for >250ms
114
- โ€ข Blocks: Silence gaps >350ms mid-utterance
115
- โ€ข Repetitions: DTW distance <0.15 + text matching
116
- โ€ข Dysfluency: Entropy >3.5 or confidence <0.4
117
-
118
- [5] ENSEMBLE DECISION FUSION:
119
- โ€ข Multi-layer cascade: Block > Repetition > Prolongation
120
- โ€ข Anomaly detection (Isolation Forest) for outliers
121
- โ€ข Speaking-rate normalization for adaptive thresholds
122
-
123
- โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
124
- KEY IMPROVEMENTS FROM ORIGINAL CODE:
125
- โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
126
-
127
- โœ… Praat-based voice quality analysis (jitter/shimmer/HNR)
128
- โœ… Formant tracking with vowel space area calculation
129
- โœ… Zero-crossing rate for phonation analysis
130
- โœ… Spectral flux for rapid acoustic changes
131
- โœ… Enhanced entropy calculation with frame-level detail
132
- โœ… Isolation Forest anomaly detection
133
- โœ… Multi-feature fusion with weighted scoring
134
- โœ… Adaptive thresholds based on speaking rate
135
- โœ… Comprehensive clinical severity mapping
136
-
137
- โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
138
  """
139
 
140
  def __init__(self):
@@ -158,63 +109,29 @@ class AdvancedStutterDetector:
158
 
159
  # Debug: Log processor structure
160
  logger.info(f"๐Ÿ“‹ Processor type: {type(self.processor)}")
161
- logger.info(f"๐Ÿ“‹ Processor attributes: {[attr for attr in dir(self.processor) if not attr.startswith('_')]}")
162
  if hasattr(self.processor, 'tokenizer'):
163
  logger.info(f"๐Ÿ“‹ Tokenizer type: {type(self.processor.tokenizer)}")
164
  if hasattr(self.processor, 'feature_extractor'):
165
  logger.info(f"๐Ÿ“‹ Feature extractor type: {type(self.processor.feature_extractor)}")
166
- self.loaded_adapters = set() # Keep for backward compatibility but not used with indicwav2vec
167
-
168
- # Anomaly Detection Model (for outlier stutter events)
169
- self.anomaly_detector = IsolationForest(
170
- contamination=0.1, # Expect 10% of frames to be anomalous
171
- random_state=42
172
- )
173
 
174
- logger.info("โœ… Engine Online - Advanced Research Algorithm Loaded")
175
  except Exception as e:
176
  logger.error(f"๐Ÿ”ฅ Engine Failure: {e}")
177
  raise
178
 
179
  def _init_common_adapters(self):
180
- """Preload common language adapters - Not applicable for indicwav2vec-hindi"""
181
- # IndicWav2Vec Hindi model is pre-trained for Hindi, no adapters needed
182
  pass
183
 
184
- def _detect_language_robust(self, audio_path: str) -> str:
185
- """Detect language using MMS LID model"""
186
- try:
187
- from transformers import Wav2Vec2ForSequenceClassification
188
- lid_model = Wav2Vec2ForSequenceClassification.from_pretrained(
189
- LID_MODEL_ID,
190
- token=HF_TOKEN
191
- ).to(DEVICE)
192
- lid_processor = AutoFeatureExtractor.from_pretrained(
193
- LID_MODEL_ID,
194
- token=HF_TOKEN
195
- )
196
-
197
- audio, sr = librosa.load(audio_path, sr=16000)
198
- inputs = lid_processor(audio, sampling_rate=16000, return_tensors="pt").to(DEVICE)
199
-
200
- with torch.no_grad():
201
- outputs = lid_model(**inputs)
202
- predicted_id = torch.argmax(outputs.logits, dim=-1).item()
203
-
204
- # Map to language code (simplified - would need actual label mapping)
205
- return 'eng' # Default fallback
206
- except Exception as e:
207
- logger.warning(f"Language detection failed: {e}, defaulting to 'eng'")
208
- return 'eng'
209
-
210
  def _activate_adapter(self, lang_code: str):
211
- """Activate language adapter - Not applicable for indicwav2vec-hindi"""
212
- # IndicWav2Vec Hindi model is pre-trained for Hindi, no adapter switching needed
213
- # Log for debugging but no action required
214
- if lang_code != 'hin':
215
- logger.info(f"Note: Using Hindi-specific model (indicwav2vec-hindi), language code '{lang_code}' requested but model is optimized for Hindi")
216
  pass
217
 
 
 
 
 
218
  def _extract_comprehensive_features(self, audio: np.ndarray, sr: int, audio_path: str) -> Dict[str, Any]:
219
  """Extract multi-modal acoustic features"""
220
  features = {}
@@ -708,131 +625,89 @@ class AdvancedStutterDetector:
708
  }
709
 
710
 
711
- def analyze_audio(self, audio_path: str, proper_transcript: str = "", language: str = 'english') -> dict:
712
  """
713
- Main analysis pipeline with comprehensive feature extraction
 
 
714
  """
715
  start_time = time.time()
716
 
717
- # === STEP 1: Language Detection & Setup ===
718
- # Note: indicwav2vec-hindi is optimized for Hindi, but can handle other languages
719
- if language == 'auto':
720
- lang_code = self._detect_language_robust(audio_path)
721
- else:
722
- lang_code = INDIAN_LANGUAGES.get(language.lower(), 'hin') # Default to Hindi for indicwav2vec
723
- self._activate_adapter(lang_code)
724
-
725
- # === STEP 2: Audio Loading & Preprocessing ===
726
  audio, sr = librosa.load(audio_path, sr=16000)
727
  duration = librosa.get_duration(y=audio, sr=sr)
728
 
729
- # === STEP 3: Multi-Modal Feature Extraction ===
730
- features = self._extract_comprehensive_features(audio, sr, audio_path)
731
-
732
- # === STEP 4: Wav2Vec2 Transcription & Uncertainty ===
733
  transcript, word_timestamps, logits = self._transcribe_with_timestamps(audio)
734
- logger.info(f"๐Ÿ“ Main transcription result: '{transcript}' (length: {len(transcript)}, words: {len(word_timestamps)})")
 
 
735
  entropy_score, low_conf_regions = self._calculate_uncertainty(logits)
 
 
736
 
737
- # === STEP 5: Speaking Rate Estimation ===
738
- speaking_rate = self._estimate_speaking_rate(audio, sr)
739
-
740
- # === STEP 6: Multi-Layer Stutter Detection ===
741
  events = []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
742
 
743
- # Layer A: Spectral Prolongation Detection
744
- events.extend(self._detect_prolongations_advanced(
745
- features['mfcc'],
746
- features['spectral_flux'],
747
- speaking_rate,
748
- word_timestamps
749
- ))
750
-
751
- # Layer B: Silence Block Detection
752
- events.extend(self._detect_blocks_enhanced(
753
- audio, sr,
754
- features['rms_energy'],
755
- features['zcr'],
756
- word_timestamps,
757
- speaking_rate
758
- ))
759
-
760
- # Layer C: DTW-Based Repetition Detection
761
- events.extend(self._detect_repetitions_advanced(
762
- features['mfcc'],
763
- features['formants'],
764
- word_timestamps,
765
- transcript,
766
- speaking_rate
767
- ))
768
-
769
- # Layer D: Voice Quality Dysfluencies (Jitter/Shimmer)
770
- events.extend(self._detect_voice_quality_issues(
771
- audio_path,
772
- word_timestamps,
773
- features['voice_quality']
774
- ))
775
-
776
- # Layer E: Entropy-Based Uncertainty Events
777
- for region in low_conf_regions:
778
- if not self._is_overlapping(region['time'], events):
779
- events.append(StutterEvent(
780
- type='dysfluency',
781
- start=region['time'],
782
- end=region['time'] + 0.3,
783
- text="<uncertainty>",
784
- confidence=0.4,
785
- acoustic_features={'entropy': entropy_score}
786
- ))
787
-
788
- # Layer F: Anomaly Detection (Isolation Forest)
789
- events = self._detect_anomalies(events, features)
790
-
791
- # === STEP 7: Event Fusion & Deduplication ===
792
- cleaned_events = self._deduplicate_events_cascade(events)
793
-
794
- # === STEP 8: Clinical Metrics & Severity Assessment ===
795
- metrics = self._calculate_clinical_metrics(
796
- cleaned_events,
797
- duration,
798
- speaking_rate,
799
- features
800
- )
801
-
802
- # Severity upgrade if global confidence is very low
803
- if metrics['confidence'] < 0.6 and metrics['severity_label'] == 'none':
804
- metrics['severity_label'] = 'mild'
805
- metrics['severity_score'] = max(metrics['severity_score'], 5.0)
806
-
807
- # === STEP 9: Return Comprehensive Report ===
808
- # Ensure transcripts are not None
809
  actual_transcript = transcript if transcript else ""
810
- target_transcript = proper_transcript if proper_transcript else transcript if transcript else ""
811
 
812
- logger.info(f"๐Ÿ“ Final return - Actual: '{actual_transcript}' (len: {len(actual_transcript)}), Target: '{target_transcript}' (len: {len(target_transcript)})")
813
 
814
  return {
815
  'actual_transcript': actual_transcript,
816
  'target_transcript': target_transcript,
817
- 'mismatched_chars': [f"{r['time']}s" for r in low_conf_regions],
818
- 'mismatch_percentage': metrics['severity_score'],
819
  'ctc_loss_score': round(entropy_score, 4),
820
- 'stutter_timestamps': [self._event_to_dict(e) for e in cleaned_events],
821
- 'total_stutter_duration': metrics['total_duration'],
822
- 'stutter_frequency': metrics['frequency'],
823
- 'severity': metrics['severity_label'],
824
- 'confidence_score': metrics['confidence'],
825
- 'speaking_rate_sps': round(speaking_rate, 2),
826
- 'voice_quality_metrics': features['voice_quality'],
827
- 'formant_analysis': features['formant_summary'],
828
- 'acoustic_features': {
829
- 'avg_mfcc_variance': float(np.var(features['mfcc'])),
830
- 'avg_zcr': float(np.mean(features['zcr'])),
831
- 'spectral_flux_mean': float(np.mean(features['spectral_flux'])),
832
- 'energy_entropy': float(np.mean(features['energy_entropy']))
833
- },
834
  'analysis_duration_seconds': round(time.time() - start_time, 2),
835
- 'model_version': f'indicwav2vec-hindi-v1-{lang_code}'
836
  }
837
 
838
 
 
2
  import os
3
  import librosa
4
  import torch
 
 
5
  import logging
6
  import numpy as np
7
+ from transformers import Wav2Vec2ForCTC, AutoProcessor
 
8
  import time
 
9
  from dataclasses import dataclass, field
10
+ from typing import List, Dict, Any, Tuple
11
+ # Simplified: Only using ASR transcription, removed complex signal processing libraries
 
 
 
 
 
 
12
 
13
  logger = logging.getLogger(__name__)
14
 
15
  # === CONFIGURATION ===
16
+ MODEL_ID = "ai4bharat/indicwav2vec-hindi" # Only model used - IndicWav2Vec Hindi for ASR
 
17
  DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
18
  HF_TOKEN = os.getenv("HF_TOKEN") # Hugging Face token for authenticated model access
19
 
 
74
 
75
  class AdvancedStutterDetector:
76
  """
77
+ ๐ŸŽค IndicWav2Vec Hindi ASR Engine
78
+
79
+ Simplified engine using ONLY ai4bharat/indicwav2vec-hindi for Automatic Speech Recognition.
80
+
81
+ Features:
82
+ - Speech-to-text transcription using IndicWav2Vec Hindi model
83
+ - Text-based stutter analysis from transcription
84
+ - Confidence scoring from model predictions
85
+ - Basic dysfluency detection from transcript patterns
86
+
87
+ Model: ai4bharat/indicwav2vec-hindi (Wav2Vec2ForCTC)
88
+ Purpose: Automatic Speech Recognition (ASR) for Hindi and Indian languages
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  """
90
 
91
  def __init__(self):
 
109
 
110
  # Debug: Log processor structure
111
  logger.info(f"๐Ÿ“‹ Processor type: {type(self.processor)}")
 
112
  if hasattr(self.processor, 'tokenizer'):
113
  logger.info(f"๐Ÿ“‹ Tokenizer type: {type(self.processor.tokenizer)}")
114
  if hasattr(self.processor, 'feature_extractor'):
115
  logger.info(f"๐Ÿ“‹ Feature extractor type: {type(self.processor.feature_extractor)}")
 
 
 
 
 
 
 
116
 
117
+ logger.info("โœ… IndicWav2Vec Hindi ASR Engine Loaded")
118
  except Exception as e:
119
  logger.error(f"๐Ÿ”ฅ Engine Failure: {e}")
120
  raise
121
 
122
  def _init_common_adapters(self):
123
+ """Not applicable - IndicWav2Vec Hindi doesn't use adapters"""
 
124
  pass
125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
  def _activate_adapter(self, lang_code: str):
127
+ """Not applicable - IndicWav2Vec Hindi doesn't use adapters"""
128
+ logger.info(f"Using IndicWav2Vec Hindi model (optimized for Hindi)")
 
 
 
129
  pass
130
 
131
+ # ===== LEGACY METHODS (NOT USED IN ASR-ONLY MODE) =====
132
+ # These methods are kept for reference but not called in the simplified ASR pipeline
133
+ # They require additional libraries (parselmouth, fastdtw, sklearn) that are not needed for ASR-only mode
134
+
135
  def _extract_comprehensive_features(self, audio: np.ndarray, sr: int, audio_path: str) -> Dict[str, Any]:
136
  """Extract multi-modal acoustic features"""
137
  features = {}
 
625
  }
626
 
627
 
628
+ def analyze_audio(self, audio_path: str, proper_transcript: str = "", language: str = 'hindi') -> dict:
629
  """
630
+ Main ASR analysis pipeline using IndicWav2Vec Hindi model
631
+
632
+ Focus: Automatic Speech Recognition (ASR) transcription only
633
  """
634
  start_time = time.time()
635
 
636
+ # === STEP 1: Audio Loading & Preprocessing ===
 
 
 
 
 
 
 
 
637
  audio, sr = librosa.load(audio_path, sr=16000)
638
  duration = librosa.get_duration(y=audio, sr=sr)
639
 
640
+ # === STEP 2: ASR Transcription using IndicWav2Vec Hindi ===
 
 
 
641
  transcript, word_timestamps, logits = self._transcribe_with_timestamps(audio)
642
+ logger.info(f"๐Ÿ“ ASR Transcription: '{transcript}' (length: {len(transcript)}, words: {len(word_timestamps)})")
643
+
644
+ # === STEP 3: Calculate Confidence from Model Predictions ===
645
  entropy_score, low_conf_regions = self._calculate_uncertainty(logits)
646
+ avg_confidence = 1.0 - (entropy_score / 10.0) if entropy_score > 0 else 0.8
647
+ avg_confidence = max(0.0, min(1.0, avg_confidence))
648
 
649
+ # === STEP 4: Basic Text-based Analysis ===
650
+ # Simple text-based stutter detection (repetitions, hesitations)
 
 
651
  events = []
652
+ if transcript:
653
+ words = transcript.split()
654
+ # Detect word repetitions
655
+ for i in range(len(words) - 1):
656
+ if words[i] == words[i+1] and i < len(word_timestamps) - 1:
657
+ events.append(StutterEvent(
658
+ type='repetition',
659
+ start=word_timestamps[i]['start'] if i < len(word_timestamps) else 0,
660
+ end=word_timestamps[i+1]['end'] if i+1 < len(word_timestamps) else 0,
661
+ text=words[i],
662
+ confidence=0.7
663
+ ))
664
+
665
+ # Add low confidence regions as potential dysfluencies
666
+ for region in low_conf_regions[:5]: # Limit to first 5
667
+ events.append(StutterEvent(
668
+ type='dysfluency',
669
+ start=region['time'],
670
+ end=region['time'] + 0.3,
671
+ text="<uncertainty>",
672
+ confidence=0.4,
673
+ acoustic_features={'entropy': entropy_score}
674
+ ))
675
+
676
+ # === STEP 5: Calculate Basic Metrics ===
677
+ total_duration = sum(e.end - e.start for e in events)
678
+ frequency = (len(events) / duration * 60) if duration > 0 else 0
679
+ stutter_percentage = (total_duration / duration * 100) if duration > 0 else 0
680
+
681
+ # Simple severity assessment
682
+ if stutter_percentage < 5:
683
+ severity = 'none'
684
+ elif stutter_percentage < 15:
685
+ severity = 'mild'
686
+ elif stutter_percentage < 30:
687
+ severity = 'moderate'
688
+ else:
689
+ severity = 'severe'
690
 
691
+ # === STEP 6: Return ASR Results ===
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
692
  actual_transcript = transcript if transcript else ""
693
+ target_transcript = proper_transcript if proper_transcript else ""
694
 
695
+ logger.info(f"๐Ÿ“ Final ASR result - Actual: '{actual_transcript}' (len: {len(actual_transcript)}), Target: '{target_transcript}' (len: {len(target_transcript)})")
696
 
697
  return {
698
  'actual_transcript': actual_transcript,
699
  'target_transcript': target_transcript,
700
+ 'mismatched_chars': [f"{r['time']:.2f}s" for r in low_conf_regions[:10]],
701
+ 'mismatch_percentage': round(stutter_percentage, 2),
702
  'ctc_loss_score': round(entropy_score, 4),
703
+ 'stutter_timestamps': [self._event_to_dict(e) for e in events],
704
+ 'total_stutter_duration': round(total_duration, 2),
705
+ 'stutter_frequency': round(frequency, 2),
706
+ 'severity': severity,
707
+ 'confidence_score': round(avg_confidence, 2),
708
+ 'speaking_rate_sps': round(len(word_timestamps) / duration if duration > 0 else 0, 2),
 
 
 
 
 
 
 
 
709
  'analysis_duration_seconds': round(time.time() - start_time, 2),
710
+ 'model_version': 'indicwav2vec-hindi-asr-v1'
711
  }
712
 
713