Spaces:

anfastech
/

slaq-version-c-ai-enginee

Running

App Files Files Community

anfastech commited on 12 days ago

Commit

e7e9fa8

1 Parent(s): 439ae4d

Updation: Simplifying the AI engine to use only ai4bharat/indicwav2vec-hindi for ASR.

Browse files

Files changed (3) hide show

MODEL_SUMMARY.md +130 -0
TRANSCRIPT_DEBUG.md +213 -0
diagnosis/ai_engine/detect_stuttering.py +87 -212

MODEL_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,130 @@

+# AI Engine Model Summary
+## Simplified ASR-Only Configuration
+This engine has been simplified to use **ONLY** the IndicWav2Vec Hindi model for Automatic Speech Recognition (ASR).
+---
+## Active Model
+### 1. IndicWav2Vec Hindi (Primary & Only Model)
+- **Model ID**: `ai4bharat/indicwav2vec-hindi`
+- **Type**: `Wav2Vec2ForCTC`
+- **Purpose**: Automatic Speech Recognition (ASR) for Hindi and Indian languages
+- **Status**: ✅ Active - Loaded at startup
+- **Location**: `detect_stuttering.py` lines 26, 148-156
+- **Authentication**: Requires `HF_TOKEN` environment variable
+**Features:**
+- Speech-to-text transcription
+- Confidence scoring from model predictions
+- Text-based stutter analysis (simple repetition detection)
+---
+## Removed Models
+The following models have been **removed** to simplify the engine:
+1. ❌ **MMS Language Identification (LID)** - `facebook/mms-lid-126`
+   - Previously used for language detection
+   - No longer needed - IndicWav2Vec handles Hindi natively
+2. ❌ **Isolation Forest** (sklearn)
+   - Previously used for anomaly detection
+   - Removed - using simple text-based analysis instead
+---
+## Removed Libraries
+The following signal processing libraries are no longer used:
+- ❌ `parselmouth` (Praat) - Voice quality analysis
+- ❌ `fastdtw` - Repetition detection via DTW
+- ❌ `sklearn` - Machine learning algorithms
+- ❌ Complex acoustic feature extraction (MFCC, formants, etc.)
+---
+## Current Pipeline
+```
+Audio Input
+    ↓
+IndicWav2Vec Hindi ASR
+    ↓
+Text Transcription
+    ↓
+Basic Text Analysis
+    ↓
+Results (transcript + simple stutter detection)
+```
+---
+## API Response Format
+The simplified engine returns:
+```json
+{
+  "actual_transcript": "transcribed text",
+  "target_transcript": "expected text (if provided)",
+  "mismatched_chars": ["timestamps of low confidence regions"],
+  "mismatch_percentage": 0.0,
+  "ctc_loss_score": 0.0,
+  "stutter_timestamps": [{"type": "repetition", "start": 0.0, "end": 0.5, ...}],
+  "total_stutter_duration": 0.0,
+  "stutter_frequency": 0.0,
+  "severity": "none|mild|moderate|severe",
+  "confidence_score": 0.8,
+  "speaking_rate_sps": 0.0,
+  "analysis_duration_seconds": 0.0,
+  "model_version": "indicwav2vec-hindi-asr-v1"
+}
+```
+---
+## Dependencies
+**Required:**
+- `transformers` 4.35.0 - For IndicWav2Vec model
+- `torch` 2.0.1 - PyTorch backend
+- `librosa` ≥0.10.0 - Audio loading (16kHz resampling)
+- `numpy` - Array operations
+**Optional (for legacy methods, not used in ASR mode):**
+- `parselmouth` - Voice quality (not used)
+- `fastdtw` - DTW algorithm (not used)
+- `sklearn` - ML algorithms (not used)
+---
+## Usage
+```python
+from diagnosis.ai_engine.detect_stuttering import get_stutter_detector
+detector = get_stutter_detector()
+result = detector.analyze_audio(
+    audio_path="path/to/audio.wav",
+    proper_transcript="expected text",  # optional
+    language="hindi"  # default: hindi
+)
+print(result['actual_transcript'])  # ASR transcription
+```
+---
+## Notes
+- The engine focuses **only** on ASR transcription
+- Stutter detection is simplified to text-based repetition analysis
+- No complex acoustic feature extraction
+- Faster and lighter than the previous multi-model approach
+- Optimized for Hindi but can handle other Indian languages

TRANSCRIPT_DEBUG.md ADDED Viewed

	@@ -0,0 +1,213 @@

+# Transcript Debugging Guide
+## Issue: Empty Transcripts ("No transcript available")
+## Complete Flow Analysis
+### 1. Django App → API Request (`slaq-version-c/diagnosis/ai_engine/detect_stuttering.py`)
+**Location:** Line 269-274
+```python
+response = requests.post(
+    self.api_url,
+    files=files,
+    data={
+        "transcript": proper_transcript if proper_transcript else "",
+        "language": lang_code,
+    },
+    timeout=self.api_timeout
+)
+```
+**Status:** ✅ Sending transcript parameter correctly
+---
+### 2. API Receives Request (`slaq-version-c-ai-enginee/app.py`)
+**Location:** Line 70-73
+```python
+@app.post("/analyze")
+async def analyze_audio(
+    audio: UploadFile = File(...),
+    transcript: str = Form("")  # ✅ Fixed: Now uses Form() for multipart
+):
+```
+**Status:** ✅ Fixed - Now correctly receives transcript via Form()
+---
+### 3. API Calls Model (`slaq-version-c-ai-enginee/app.py`)
+**Location:** Line 106
+```python
+result = detector.analyze_audio(temp_file, transcript)
+```
+**Status:** ✅ Passing transcript correctly
+---
+### 4. Model Transcribes Audio (`slaq-version-c-ai-enginee/diagnosis/ai_engine/detect_stuttering.py`)
+**Location:** Line 313-369 (`_transcribe_with_timestamps`)
+**Potential Issues:**
+- ❓ IndicWav2Vec decoding might not work with `processor.batch_decode()`
+- ❓ Need to use tokenizer directly
+- ❓ Model might not be producing valid predictions
+**Status:** ⚠️ **LIKELY ISSUE HERE** - Decoding method may be incorrect
+---
+### 5. Model Returns Result (`slaq-version-c-ai-enginee/diagnosis/ai_engine/detect_stuttering.py`)
+**Location:** Line 787-794
+```python
+actual_transcript = transcript if transcript else ""
+target_transcript = proper_transcript if proper_transcript else transcript if transcript else ""
+return {
+    'actual_transcript': actual_transcript,
+    'target_transcript': target_transcript,
+    ...
+}
+```
+**Status:** ✅ Returns transcripts correctly (if transcript is not empty)
+---
+### 6. API Returns Response (`slaq-version-c-ai-enginee/app.py`)
+**Location:** Line 109-113
+```python
+actual = result.get('actual_transcript', '')
+target = result.get('target_transcript', '')
+logger.info(f"📝 Result transcripts - Actual: '{actual[:100]}' (len: {len(actual)}), Target: '{target[:100]}' (len: {len(target)})")
+return result
+```
+**Status:** ✅ Returns JSON with transcripts
+---
+### 7. Django Receives Response (`slaq-version-c/diagnosis/ai_engine/detect_stuttering.py`)
+**Location:** Line 279-410
+```python
+result = response.json()
+# ... formatting ...
+actual_transcript = str(api_result.get('actual_transcript', '')).strip()
+target_transcript = str(api_result.get('target_transcript', '')).strip()
+```
+**Status:** ✅ Extracts transcripts correctly
+---
+### 8. Django Saves to Database (`slaq-version-c/diagnosis/tasks.py`)
+**Location:** Line 141-142
+```python
+actual_transcript=actual_transcript,
+target_transcript=target_transcript,
+```
+**Status:** ✅ Saves correctly
+---
+## Root Cause Analysis
+### Most Likely Issue: Transcription Decoding
+The IndicWav2Vec model (`ai4bharat/indicwav2vec-hindi`) may require:
+1. **Direct tokenizer access** instead of `processor.batch_decode()`
+2. **CTC decoding** with proper tokenizer
+3. **Special handling** for Indic scripts
+### Fix Applied
+Updated `_transcribe_with_timestamps()` to:
+1. Try multiple decoding methods
+2. Use tokenizer directly if available
+3. Add comprehensive error logging
+4. Log predicted IDs for debugging
+---
+## Debugging Steps
+### 1. Check API Logs
+When processing audio, look for:
+```
+📝 Transcribed text: '...' (length: X)
+📝 Final return - Actual: '...' (len: X), Target: '...' (len: Y)
+📝 Result transcripts - Actual: '...' (len: X), Target: '...' (len: Y)
+```
+### 2. Check Django Logs
+Look for:
+```
+📝 Final transcripts - Actual: X chars, Target: Y chars
+📝 Saving transcripts - Actual: X chars, Target: Y chars
+```
+### 3. Check Database
+Query the `AnalysisResult` table:
+```sql
+SELECT actual_transcript, target_transcript, LENGTH(actual_transcript) as actual_len, LENGTH(target_transcript) as target_len
+FROM diagnosis_analysisresult
+ORDER BY created_at DESC LIMIT 5;
+```
+### 4. Test API Directly
+```bash
+curl -X POST "http://localhost:7860/analyze" \
+  -F "[email protected]" \
+  -F "transcript=test transcript" \
+  -F "language=hin"
+```
+Check the response JSON for `actual_transcript` and `target_transcript`.
+---
+## Next Steps
+1. **Rebuild Docker image** with latest changes
+2. **Check logs** during audio processing
+3. **Verify processor structure** - logs will show processor attributes
+4. **Test with Hindi audio** - model is optimized for Hindi
+5. **Check if model is loaded correctly** - verify HF_TOKEN is working
+---
+## Expected Log Output (Success)
+```
+🚀 Initializing Advanced AI Engine on cpu...
+✅ HF_TOKEN found - using authenticated model access
+📋 Processor type: <class 'transformers.models.wav2vec2.processing_wav2vec2.Wav2Vec2Processor'>
+📋 Processor attributes: ['batch_decode', 'decode', 'feature_extractor', 'tokenizer', ...]
+📋 Tokenizer type: <class 'transformers.models.wav2vec2.tokenization_wav2vec2.Wav2Vec2CTCTokenizer'>
+📝 Transcribed text: 'नमस्ते मैं हिंदी बोल रहा हूं' (length: 25)
+📝 Final return - Actual: 'नमस्ते मैं हिंदी बोल रहा हूं' (len: 25), Target: '...' (len: X)
+```
+---
+## If Still Empty
+1. **Model may not be loaded correctly** - check HF_TOKEN
+2. **Audio format issue** - ensure 16kHz mono WAV
+3. **Model not producing predictions** - check predicted_ids in logs
+4. **Tokenizer mismatch** - IndicWav2Vec may need special tokenizer initialization

diagnosis/ai_engine/detect_stuttering.py CHANGED Viewed

@@ -2,29 +2,18 @@
 import os
 import librosa
 import torch
-import torchaudio
-import torch.nn as nn
 import logging
 import numpy as np
-import parselmouth
-from transformers import Wav2Vec2ForCTC, AutoProcessor, Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
 import time
-from collections import Counter
 from dataclasses import dataclass, field
-from typing import List, Dict, Any, Tuple, Optional
-from scipy.signal import correlate, butter, filtfilt
-from scipy.spatial.distance import euclidean, cosine
-from scipy.spatial import ConvexHull
-from scipy.stats import kurtosis, skew
-from fastdtw import fastdtw
-from sklearn.preprocessing import StandardScaler
-from sklearn.ensemble import IsolationForest
 logger = logging.getLogger(__name__)
 # === CONFIGURATION ===
-MODEL_ID = "ai4bharat/indicwav2vec-hindi"
-LID_MODEL_ID = "facebook/mms-lid-126"
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 HF_TOKEN = os.getenv("HF_TOKEN")  # Hugging Face token for authenticated model access
@@ -85,56 +74,18 @@ class StutterEvent:
 class AdvancedStutterDetector:
     """
-    🧠 2024-2025 State-of-the-Art Stuttering Detection Engine
-    ═══════════════════════════════════════════════════════════
-    RESEARCH FOUNDATION (Latest Publications):
-    ═══════════════════════════════════════════════════════════
-    [1] ACOUSTIC FEATURES:
-        • MFCC (20 coefficients) - spectral envelope
-        • Formant tracking (F1-F4) - vowel space analysis
-        • Pitch contour (F0) - intonation patterns
-        • Zero-Crossing Rate - voiced/unvoiced classification
-        • Spectral flux - rapid spectral changes
-        • Energy entropy - signal chaos measurement
-    [2] VOICE QUALITY METRICS (Parselmouth/Praat):
-        • Jitter (>1% threshold) - pitch perturbation
-        • Shimmer (>3% threshold) - amplitude perturbation
-        • HNR (<15 dB threshold) - harmonics-to-noise ratio
-    [3] FORMANT ANALYSIS (Vowel Space):
-        • Untreated stutterers show 70% vowel space reduction
-        • F1-F2 centralization indicates restricted articulation
-        • Post-treatment: vowel space normalizes
-    [4] DETECTION ALGORITHMS:
-        • Prolongation: Spectral correlation >0.9 for >250ms
-        • Blocks: Silence gaps >350ms mid-utterance
-        • Repetitions: DTW distance <0.15 + text matching
-        • Dysfluency: Entropy >3.5 or confidence <0.4
-    [5] ENSEMBLE DECISION FUSION:
-        • Multi-layer cascade: Block > Repetition > Prolongation
-        • Anomaly detection (Isolation Forest) for outliers
-        • Speaking-rate normalization for adaptive thresholds
-    ═══════════════════════════════════════════════════════════
-    KEY IMPROVEMENTS FROM ORIGINAL CODE:
-    ═══════════════════════════════════════════════════════════
-    ✅ Praat-based voice quality analysis (jitter/shimmer/HNR)
-    ✅ Formant tracking with vowel space area calculation
-    ✅ Zero-crossing rate for phonation analysis
-    ✅ Spectral flux for rapid acoustic changes
-    ✅ Enhanced entropy calculation with frame-level detail
-    ✅ Isolation Forest anomaly detection
-    ✅ Multi-feature fusion with weighted scoring
-    ✅ Adaptive thresholds based on speaking rate
-    ✅ Comprehensive clinical severity mapping
-    ═══════════════════════════════════════════════════════════
     """
     def __init__(self):
@@ -158,63 +109,29 @@ class AdvancedStutterDetector:
             # Debug: Log processor structure
             logger.info(f"📋 Processor type: {type(self.processor)}")
-            logger.info(f"📋 Processor attributes: {[attr for attr in dir(self.processor) if not attr.startswith('_')]}")
             if hasattr(self.processor, 'tokenizer'):
                 logger.info(f"📋 Tokenizer type: {type(self.processor.tokenizer)}")
             if hasattr(self.processor, 'feature_extractor'):
                 logger.info(f"📋 Feature extractor type: {type(self.processor.feature_extractor)}")
-            self.loaded_adapters = set()  # Keep for backward compatibility but not used with indicwav2vec
-            # Anomaly Detection Model (for outlier stutter events)
-            self.anomaly_detector = IsolationForest(
-                contamination=0.1,  # Expect 10% of frames to be anomalous
-                random_state=42
-            )
-            logger.info("✅ Engine Online - Advanced Research Algorithm Loaded")
         except Exception as e:
             logger.error(f"🔥 Engine Failure: {e}")
             raise
     def _init_common_adapters(self):
-        """Preload common language adapters - Not applicable for indicwav2vec-hindi"""
-        # IndicWav2Vec Hindi model is pre-trained for Hindi, no adapters needed
         pass
-    def _detect_language_robust(self, audio_path: str) -> str:
-        """Detect language using MMS LID model"""
-        try:
-            from transformers import Wav2Vec2ForSequenceClassification
-            lid_model = Wav2Vec2ForSequenceClassification.from_pretrained(
-                LID_MODEL_ID,
-                token=HF_TOKEN
-            ).to(DEVICE)
-            lid_processor = AutoFeatureExtractor.from_pretrained(
-                LID_MODEL_ID,
-                token=HF_TOKEN
-            )
-            audio, sr = librosa.load(audio_path, sr=16000)
-            inputs = lid_processor(audio, sampling_rate=16000, return_tensors="pt").to(DEVICE)
-            with torch.no_grad():
-                outputs = lid_model(**inputs)
-                predicted_id = torch.argmax(outputs.logits, dim=-1).item()
-            # Map to language code (simplified - would need actual label mapping)
-            return 'eng'  # Default fallback
-        except Exception as e:
-            logger.warning(f"Language detection failed: {e}, defaulting to 'eng'")
-            return 'eng'
     def _activate_adapter(self, lang_code: str):
-        """Activate language adapter - Not applicable for indicwav2vec-hindi"""
-        # IndicWav2Vec Hindi model is pre-trained for Hindi, no adapter switching needed
-        # Log for debugging but no action required
-        if lang_code != 'hin':
-            logger.info(f"Note: Using Hindi-specific model (indicwav2vec-hindi), language code '{lang_code}' requested but model is optimized for Hindi")
         pass
     def _extract_comprehensive_features(self, audio: np.ndarray, sr: int, audio_path: str) -> Dict[str, Any]:
         """Extract multi-modal acoustic features"""
         features = {}
@@ -708,131 +625,89 @@ class AdvancedStutterDetector:
         }
-    def analyze_audio(self, audio_path: str, proper_transcript: str = "", language: str = 'english') -> dict:
         """
-        Main analysis pipeline with comprehensive feature extraction
         """
         start_time = time.time()
-        # === STEP 1: Language Detection & Setup ===
-        # Note: indicwav2vec-hindi is optimized for Hindi, but can handle other languages
-        if language == 'auto':
-            lang_code = self._detect_language_robust(audio_path)
-        else:
-            lang_code = INDIAN_LANGUAGES.get(language.lower(), 'hin')  # Default to Hindi for indicwav2vec
-        self._activate_adapter(lang_code)
-        # === STEP 2: Audio Loading & Preprocessing ===
         audio, sr = librosa.load(audio_path, sr=16000)
         duration = librosa.get_duration(y=audio, sr=sr)
-        # === STEP 3: Multi-Modal Feature Extraction ===
-        features = self._extract_comprehensive_features(audio, sr, audio_path)
-        # === STEP 4: Wav2Vec2 Transcription & Uncertainty ===
         transcript, word_timestamps, logits = self._transcribe_with_timestamps(audio)
-        logger.info(f"📝 Main transcription result: '{transcript}' (length: {len(transcript)}, words: {len(word_timestamps)})")
         entropy_score, low_conf_regions = self._calculate_uncertainty(logits)
-        # === STEP 5: Speaking Rate Estimation ===
-        speaking_rate = self._estimate_speaking_rate(audio, sr)
-        # === STEP 6: Multi-Layer Stutter Detection ===
         events = []
-        # Layer A: Spectral Prolongation Detection
-        events.extend(self._detect_prolongations_advanced(
-            features['mfcc'],
-            features['spectral_flux'],
-            speaking_rate,
-            word_timestamps
-        ))
-        # Layer B: Silence Block Detection
-        events.extend(self._detect_blocks_enhanced(
-            audio, sr,
-            features['rms_energy'],
-            features['zcr'],
-            word_timestamps,
-            speaking_rate
-        ))
-        # Layer C: DTW-Based Repetition Detection
-        events.extend(self._detect_repetitions_advanced(
-            features['mfcc'],
-            features['formants'],
-            word_timestamps,
-            transcript,
-            speaking_rate
-        ))
-        # Layer D: Voice Quality Dysfluencies (Jitter/Shimmer)
-        events.extend(self._detect_voice_quality_issues(
-            audio_path,
-            word_timestamps,
-            features['voice_quality']
-        ))
-        # Layer E: Entropy-Based Uncertainty Events
-        for region in low_conf_regions:
-            if not self._is_overlapping(region['time'], events):
-                events.append(StutterEvent(
-                    type='dysfluency',
-                    start=region['time'],
-                    end=region['time'] + 0.3,
-                    text="<uncertainty>",
-                    confidence=0.4,
-                    acoustic_features={'entropy': entropy_score}
-                ))
-        # Layer F: Anomaly Detection (Isolation Forest)
-        events = self._detect_anomalies(events, features)
-        # === STEP 7: Event Fusion & Deduplication ===
-        cleaned_events = self._deduplicate_events_cascade(events)
-        # === STEP 8: Clinical Metrics & Severity Assessment ===
-        metrics = self._calculate_clinical_metrics(
-            cleaned_events,
-            duration,
-            speaking_rate,
-            features
-        )
-        # Severity upgrade if global confidence is very low
-        if metrics['confidence'] < 0.6 and metrics['severity_label'] == 'none':
-            metrics['severity_label'] = 'mild'
-            metrics['severity_score'] = max(metrics['severity_score'], 5.0)
-        # === STEP 9: Return Comprehensive Report ===
-        # Ensure transcripts are not None
         actual_transcript = transcript if transcript else ""
-        target_transcript = proper_transcript if proper_transcript else transcript if transcript else ""
-        logger.info(f"📝 Final return - Actual: '{actual_transcript}' (len: {len(actual_transcript)}), Target: '{target_transcript}' (len: {len(target_transcript)})")
         return {
             'actual_transcript': actual_transcript,
             'target_transcript': target_transcript,
-            'mismatched_chars': [f"{r['time']}s" for r in low_conf_regions],
-            'mismatch_percentage': metrics['severity_score'],
             'ctc_loss_score': round(entropy_score, 4),
-            'stutter_timestamps': [self._event_to_dict(e) for e in cleaned_events],
-            'total_stutter_duration': metrics['total_duration'],
-            'stutter_frequency': metrics['frequency'],
-            'severity': metrics['severity_label'],
-            'confidence_score': metrics['confidence'],
-            'speaking_rate_sps': round(speaking_rate, 2),
-            'voice_quality_metrics': features['voice_quality'],
-            'formant_analysis': features['formant_summary'],
-            'acoustic_features': {
-                'avg_mfcc_variance': float(np.var(features['mfcc'])),
-                'avg_zcr': float(np.mean(features['zcr'])),
-                'spectral_flux_mean': float(np.mean(features['spectral_flux'])),
-                'energy_entropy': float(np.mean(features['energy_entropy']))
-            },
             'analysis_duration_seconds': round(time.time() - start_time, 2),
-            'model_version': f'indicwav2vec-hindi-v1-{lang_code}'
         }

 import os
 import librosa
 import torch
 import logging
 import numpy as np
+from transformers import Wav2Vec2ForCTC, AutoProcessor
 import time
 from dataclasses import dataclass, field
+from typing import List, Dict, Any, Tuple
+# Simplified: Only using ASR transcription, removed complex signal processing libraries
 logger = logging.getLogger(__name__)
 # === CONFIGURATION ===
+MODEL_ID = "ai4bharat/indicwav2vec-hindi"  # Only model used - IndicWav2Vec Hindi for ASR
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 HF_TOKEN = os.getenv("HF_TOKEN")  # Hugging Face token for authenticated model access
 class AdvancedStutterDetector:
     """
+    🎤 IndicWav2Vec Hindi ASR Engine
+    Simplified engine using ONLY ai4bharat/indicwav2vec-hindi for Automatic Speech Recognition.
+    Features:
+    - Speech-to-text transcription using IndicWav2Vec Hindi model
+    - Text-based stutter analysis from transcription
+    - Confidence scoring from model predictions
+    - Basic dysfluency detection from transcript patterns
+    Model: ai4bharat/indicwav2vec-hindi (Wav2Vec2ForCTC)
+    Purpose: Automatic Speech Recognition (ASR) for Hindi and Indian languages
     """
     def __init__(self):
             # Debug: Log processor structure
             logger.info(f"📋 Processor type: {type(self.processor)}")
             if hasattr(self.processor, 'tokenizer'):
                 logger.info(f"📋 Tokenizer type: {type(self.processor.tokenizer)}")
             if hasattr(self.processor, 'feature_extractor'):
                 logger.info(f"📋 Feature extractor type: {type(self.processor.feature_extractor)}")
+            logger.info("✅ IndicWav2Vec Hindi ASR Engine Loaded")
         except Exception as e:
             logger.error(f"🔥 Engine Failure: {e}")
             raise
     def _init_common_adapters(self):
+        """Not applicable - IndicWav2Vec Hindi doesn't use adapters"""
         pass
     def _activate_adapter(self, lang_code: str):
+        """Not applicable - IndicWav2Vec Hindi doesn't use adapters"""
+        logger.info(f"Using IndicWav2Vec Hindi model (optimized for Hindi)")
         pass
+    # ===== LEGACY METHODS (NOT USED IN ASR-ONLY MODE) =====
+    # These methods are kept for reference but not called in the simplified ASR pipeline
+    # They require additional libraries (parselmouth, fastdtw, sklearn) that are not needed for ASR-only mode
     def _extract_comprehensive_features(self, audio: np.ndarray, sr: int, audio_path: str) -> Dict[str, Any]:
         """Extract multi-modal acoustic features"""
         features = {}
         }
+    def analyze_audio(self, audio_path: str, proper_transcript: str = "", language: str = 'hindi') -> dict:
         """
+        Main ASR analysis pipeline using IndicWav2Vec Hindi model
+        Focus: Automatic Speech Recognition (ASR) transcription only
         """
         start_time = time.time()
+        # === STEP 1: Audio Loading & Preprocessing ===
         audio, sr = librosa.load(audio_path, sr=16000)
         duration = librosa.get_duration(y=audio, sr=sr)
+        # === STEP 2: ASR Transcription using IndicWav2Vec Hindi ===
         transcript, word_timestamps, logits = self._transcribe_with_timestamps(audio)
+        logger.info(f"📝 ASR Transcription: '{transcript}' (length: {len(transcript)}, words: {len(word_timestamps)})")
+        # === STEP 3: Calculate Confidence from Model Predictions ===
         entropy_score, low_conf_regions = self._calculate_uncertainty(logits)
+        avg_confidence = 1.0 - (entropy_score / 10.0) if entropy_score > 0 else 0.8
+        avg_confidence = max(0.0, min(1.0, avg_confidence))
+        # === STEP 4: Basic Text-based Analysis ===
+        # Simple text-based stutter detection (repetitions, hesitations)
         events = []
+        if transcript:
+            words = transcript.split()
+            # Detect word repetitions
+            for i in range(len(words) - 1):
+                if words[i] == words[i+1] and i < len(word_timestamps) - 1:
+                    events.append(StutterEvent(
+                        type='repetition',
+                        start=word_timestamps[i]['start'] if i < len(word_timestamps) else 0,
+                        end=word_timestamps[i+1]['end'] if i+1 < len(word_timestamps) else 0,
+                        text=words[i],
+                        confidence=0.7
+                    ))
+        # Add low confidence regions as potential dysfluencies
+        for region in low_conf_regions[:5]:  # Limit to first 5
+            events.append(StutterEvent(
+                type='dysfluency',
+                start=region['time'],
+                end=region['time'] + 0.3,
+                text="<uncertainty>",
+                confidence=0.4,
+                acoustic_features={'entropy': entropy_score}
+            ))
+        # === STEP 5: Calculate Basic Metrics ===
+        total_duration = sum(e.end - e.start for e in events)
+        frequency = (len(events) / duration * 60) if duration > 0 else 0
+        stutter_percentage = (total_duration / duration * 100) if duration > 0 else 0
+        # Simple severity assessment
+        if stutter_percentage < 5:
+            severity = 'none'
+        elif stutter_percentage < 15:
+            severity = 'mild'
+        elif stutter_percentage < 30:
+            severity = 'moderate'
+        else:
+            severity = 'severe'
+        # === STEP 6: Return ASR Results ===
         actual_transcript = transcript if transcript else ""
+        target_transcript = proper_transcript if proper_transcript else ""
+        logger.info(f"📝 Final ASR result - Actual: '{actual_transcript}' (len: {len(actual_transcript)}), Target: '{target_transcript}' (len: {len(target_transcript)})")
         return {
             'actual_transcript': actual_transcript,
             'target_transcript': target_transcript,
+            'mismatched_chars': [f"{r['time']:.2f}s" for r in low_conf_regions[:10]],
+            'mismatch_percentage': round(stutter_percentage, 2),
             'ctc_loss_score': round(entropy_score, 4),
+            'stutter_timestamps': [self._event_to_dict(e) for e in events],
+            'total_stutter_duration': round(total_duration, 2),
+            'stutter_frequency': round(frequency, 2),
+            'severity': severity,
+            'confidence_score': round(avg_confidence, 2),
+            'speaking_rate_sps': round(len(word_timestamps) / duration if duration > 0 else 0, 2),
             'analysis_duration_seconds': round(time.time() - start_time, 2),
+            'model_version': 'indicwav2vec-hindi-asr-v1'
         }