Spaces:

WJ88
/

Parakeet-TDT-v3-ASR-Demo_Real-Time_Mic-File_Transcription

Runtime error

App Files Files Community

Parakeet-TDT-v3-ASR-Demo_Real-Time_Mic-File_Transcription / README.md

WJ88

Update README.md

599762b verified 3 months ago

preview code

raw

history blame contribute delete

3.1 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

metadata

title: Parakeet-TDT-v3-ASR-Demo Real-Time Mic-File Transcription
emoji: 🦀
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: true
license: cc-by-4.0
short_description: Transcribe Speech Real-Time from MIC - clone and use locally
inference: true
tags:
  - audio
  - speech-recognition
  - asr
  - real-time
  - cpu
  - nvidia
  - parakeet
  - microphone
  - voice
  - speech
  - browser
  - gradio
  - nemo
  - huggingface

Usage

Mic Tab: Click "RECORD" the speak into your mic - text updates live. "Flush" button does nothing, its a feature :)
Files Tab: Upload audio files (WAV); click "Run" for transcripts. (I have tried only WAV files, TODO: handle more types like mp4)

Limitations

sessions are per-browser-tab (Gradio state) - I dont know if in case many Users will launch this, will it work?
to be sure, Duplicate this Space or Clone it to your own pc - for full privacy, no GPU needed

Why is this Space amazing? (this is for people looking for low-level stuff of "AI" - yeah, I did it! BEAM! Streaming, no greedy_batch trash)

Real-Time Mic Mode: Streams audio in 2s chunks, merging hypotheses for smooth, cumulative transcripts. Handles conversations with retained context.
Advanced Decoding: Uses modern MALSD batch beam search (beam=32) for accurate, error-resistant results, outperforming basic greedy methods in ambiguous audio.
CPU Efficiency: Runs fast on standard hardware (no GPU needed), with optimized configs like no timestamps and fused batching.
File Mode Bonus: Batch transcribes uploads for quick comparisons.
Quality Edge: Approaches ideal transcripts with minimal artifacts, making it ideal for developers/testing vs. static NVIDIA spaces.

Parakeet-TDT v3 ASR Demo: Real-Time Mic & File Transcription on CPU

This Hugging Face Space demonstrates a lightweight, CPU-based Automatic Speech Recognition (ASR) application using NVIDIA's Parakeet-TDT-0.6b-v3 model from NeMo. Unlike NVIDIA's official demo (which only supports file uploads), this app shines with real-time microphone streaming transcribe live speech incrementally with high quality and context retention. It's perfect for interactive demos, voice notes, or testing multilingual ASR without a GPU.

Features Overview

Model Setup: Loads Parakeet-TDT-0.6b-v3 (RNNT-based) with MALSD decoding for beam exploration and loop labels for alignments.
Audio Handling: Resamples to 16kHz mono, supports various formats.
Streaming (Mic): Partial hypotheses for seamless updates, session-based for multi-chunk context.
UI: Gradio tabs—Mic for live input/output (flush to finalize), Files for batch results table.
Tech Stack: NeMo (ASR core), Gradio (web UI), Torchaudio/Soundfile (audio utils).

TODO:

change string-level to token level (y_sequence) hypothesis alignment (quality improvement, advanced technical stuff ;))

Contributions welcome! Fork and PR improvements. Built with ❤️ using Grok's guidance.

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference