A newer version of the Gradio SDK is available:
6.2.0
metadata
title: Parakeet-TDT-v3-ASR-Demo Real-Time Mic-File Transcription
emoji: 🦀
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: true
license: cc-by-4.0
short_description: Transcribe Speech Real-Time from MIC - clone and use locally
inference: true
tags:
- audio
- speech-recognition
- asr
- real-time
- cpu
- nvidia
- parakeet
- microphone
- voice
- speech
- browser
- gradio
- nemo
- huggingface
Usage
- Mic Tab: Click "RECORD" the speak into your mic - text updates live. "Flush" button does nothing, its a feature :)
- Files Tab: Upload audio files (WAV); click "Run" for transcripts. (I have tried only WAV files, TODO: handle more types like mp4)
Limitations
- sessions are per-browser-tab (Gradio state) - I dont know if in case many Users will launch this, will it work?
- to be sure, Duplicate this Space or Clone it to your own pc - for full privacy, no GPU needed
Why is this Space amazing? (this is for people looking for low-level stuff of "AI" - yeah, I did it! BEAM! Streaming, no greedy_batch trash)
- Real-Time Mic Mode: Streams audio in 2s chunks, merging hypotheses for smooth, cumulative transcripts. Handles conversations with retained context.
- Advanced Decoding: Uses modern MALSD batch beam search (beam=32) for accurate, error-resistant results, outperforming basic greedy methods in ambiguous audio.
- CPU Efficiency: Runs fast on standard hardware (no GPU needed), with optimized configs like no timestamps and fused batching.
- File Mode Bonus: Batch transcribes uploads for quick comparisons.
- Quality Edge: Approaches ideal transcripts with minimal artifacts, making it ideal for developers/testing vs. static NVIDIA spaces.
Parakeet-TDT v3 ASR Demo: Real-Time Mic & File Transcription on CPU
This Hugging Face Space demonstrates a lightweight, CPU-based Automatic Speech Recognition (ASR) application using NVIDIA's Parakeet-TDT-0.6b-v3 model from NeMo. Unlike NVIDIA's official demo (which only supports file uploads), this app shines with real-time microphone streaming transcribe live speech incrementally with high quality and context retention. It's perfect for interactive demos, voice notes, or testing multilingual ASR without a GPU.
Features Overview
- Model Setup: Loads Parakeet-TDT-0.6b-v3 (RNNT-based) with MALSD decoding for beam exploration and loop labels for alignments.
- Audio Handling: Resamples to 16kHz mono, supports various formats.
- Streaming (Mic): Partial hypotheses for seamless updates, session-based for multi-chunk context.
- UI: Gradio tabs—Mic for live input/output (flush to finalize), Files for batch results table.
- Tech Stack: NeMo (ASR core), Gradio (web UI), Torchaudio/Soundfile (audio utils).
TODO:
- change string-level to token level (y_sequence) hypothesis alignment (quality improvement, advanced technical stuff ;))
Contributions welcome! Fork and PR improvements. Built with ❤️ using Grok's guidance.
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference