Add streaming mode section to README

Browse files

Files changed (1) hide show

README.md +121 -2

README.md CHANGED Viewed

@@ -14,6 +14,9 @@ tags:
   - tts
   - speech
   - whisper
 language:
   - en
   - zh
@@ -28,7 +31,7 @@ pipeline_tag: image-text-to-text
 4-bit quantized [MLX](https://github.com/ml-explore/mlx) conversion of [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) for fast inference on Apple Silicon (M1/M2/M3/M4).
-Includes **all modalities**: vision, audio input (Whisper), and TTS output (CosyVoice2 Llama backbone).
 ## Model Details
@@ -68,6 +71,7 @@ Image --> SigLIP2 (27L) --> Perceiver Resampler (64 queries) -------------------
 - **Audio input**: Speech recognition, audio description, sound classification
 - **TTS output**: Text-to-speech via CosyVoice2 Llama backbone (requires Token2wav vocoder)
 - **Multilingual**: English, Chinese, Indonesian, French, German, etc.
 ## Requirements
@@ -83,8 +87,15 @@ Optional dependencies:
 ```bash
 pip install librosa                # Audio resampling (if input isn't 16kHz)
 pip install minicpmo-utils[all]    # Token2wav vocoder for TTS output
 ```
 ## Quick Start
 ### Chat Script
@@ -114,7 +125,115 @@ python chat_minicpmo.py --audio recording.wav
 python chat_minicpmo.py -p "Say hello" --tts --tts-output hello.wav
 ```
-Interactive commands: `/image <path>` | `/audio <path>` | `/clear` | `/quit`
 ### Python API

   - tts
   - speech
   - whisper
+  - streaming
+  - real-time
+  - screen-capture
 language:
   - en
   - zh
 4-bit quantized [MLX](https://github.com/ml-explore/mlx) conversion of [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) for fast inference on Apple Silicon (M1/M2/M3/M4).
+Includes **all modalities**: vision, audio input (Whisper), TTS output (CosyVoice2 Llama backbone), and **full duplex streaming** (real-time screen + audio capture).
 ## Model Details
 - **Audio input**: Speech recognition, audio description, sound classification
 - **TTS output**: Text-to-speech via CosyVoice2 Llama backbone (requires Token2wav vocoder)
 - **Multilingual**: English, Chinese, Indonesian, French, German, etc.
+- **Full duplex streaming**: Real-time screen capture + system audio analysis with continuous LLM output
 ## Requirements
 ```bash
 pip install librosa                # Audio resampling (if input isn't 16kHz)
 pip install minicpmo-utils[all]    # Token2wav vocoder for TTS output
+pip install mss sounddevice        # For streaming mode (screen + audio capture)
 ```
+For system audio capture on macOS (streaming mode):
+```bash
+brew install blackhole-2ch
+```
+Then open **Audio MIDI Setup** > create a **Multi-Output Device** combining your speakers + BlackHole 2ch.
 ## Quick Start
 ### Chat Script
 python chat_minicpmo.py -p "Say hello" --tts --tts-output hello.wav
 ```
+Interactive commands: `/image <path>` | `/audio <path>` | `/live` | `/clear` | `/quit`
+## Streaming Mode (Full Duplex)
+Real-time streaming mode captures your screen (1 fps) and system audio (16kHz) simultaneously, feeding them to the model every second for continuous analysis. Think of it as a live AI commentator for whatever's on your screen.
+**Use cases**: real-time video translation, live captioning, accessibility narration, gameplay commentary, meeting summarization.
+### Architecture
+```
+[Screen Capture 1fps] ──┐
+                        ├──> ChunkSynchronizer ──> Streaming Whisper ──> LLM (KV cache) ──> Text Output
+[System Audio 16kHz] ───┘         ↑                      ↑                    ↑                  │
+                            MelProcessor          Whisper KV cache       LLM KV cache            │
+                                                                                                  ▼
+                                                                                          TTS Playback (optional)
+```
+### Quick Start
+```bash
+# Full duplex streaming (captures primary monitor + system audio)
+python chat_minicpmo.py --live
+# Capture specific screen region
+python chat_minicpmo.py --live --capture-region 0,0,1920,1080
+# Use mic instead of system audio
+python chat_minicpmo.py --live --audio-device "MacBook Pro Microphone"
+# With TTS output (speaks responses aloud)
+python chat_minicpmo.py --live --tts
+# Or start from interactive mode
+python chat_minicpmo.py
+> /live
+```
+Press **Ctrl+C** to stop streaming.
+### CLI Options
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--live` | — | Enable full duplex streaming mode |
+| `--capture-region` | Primary monitor | Screen region as `x,y,w,h` |
+| `--audio-device` | `BlackHole` | Audio input device name |
+| `--tts` | Off | Enable TTS speech output |
+| `--temp` | `0.0` | Sampling temperature |
+| `--max-tokens` | `512` | Max tokens per chunk response |
+### How It Works
+1. **Screen capture** (`mss`): Grabs a screenshot at 1 fps, resizes to 448x448, feeds through SigLIP2 vision encoder + Perceiver Resampler (64 tokens).
+2. **Audio capture** (`sounddevice`): Records system audio via BlackHole virtual device at 16kHz. Accumulates 1-second chunks.
+3. **Streaming Whisper encoder**: Processes audio incrementally using KV cache — no need to re-encode previous audio. Conv1d buffers maintain continuity across chunk boundaries. Auto-resets when reaching 1500 positions.
+4. **LLM with KV cache continuation**: Each chunk's vision + audio embeddings are prefilled into the running LLM cache. The model decides whether to listen or speak based on the input.
+5. **Text generation**: When the model has something to say, it generates text autoregressively from the cached state. Stops at `<|im_end|>` or mode-switch tokens.
+6. **TTS playback** (optional): Generated text is converted to audio tokens via the TTS Llama backbone and played back through speakers using Token2wav.
+### Output Format
+```
+[1] The video shows a person speaking in Indonesian about cooking techniques.
+  >> chunk=1 mode=listen cache=142tok latency=1850ms mem=8.2GB
+[2] They are now demonstrating how to prepare sambal with a mortar and pestle.
+  >> chunk=2 mode=listen cache=284tok latency=2100ms mem=8.4GB
+```
+### System Audio Setup (macOS)
+To capture system audio (what's playing through your speakers), you need [BlackHole](https://github.com/ExistentialAudio/BlackHole):
+1. Install: `brew install blackhole-2ch`
+2. Open **Audio MIDI Setup** (Spotlight > "Audio MIDI Setup")
+3. Click **+** > **Create Multi-Output Device**
+4. Check both **MacBook Pro Speakers** and **BlackHole 2ch**
+5. Set this Multi-Output Device as your system output (System Preferences > Sound > Output)
+6. Run streaming with default `--audio-device BlackHole`
+Without BlackHole, use your mic: `--audio-device "MacBook Pro Microphone"`
+### Memory & Latency Budget
+| Component | Memory | Latency |
+|-----------|--------|---------|
+| Model weights | ~7.0 GB | — |
+| LLM KV cache (4096 tok) | ~1.2 GB | — |
+| Whisper KV cache (1500 pos) | ~0.3 GB | — |
+| Screen capture | — | ~10ms |
+| Mel extraction | — | ~50ms |
+| Whisper streaming encode | — | ~200ms |
+| Vision encode | — | ~150ms |
+| LLM prefill (chunk) | — | ~300ms |
+| LLM generate (50 tok) | — | ~1s |
+| **Total peak** | **~9.0 GB** | **~2.2s/chunk** |
+### Files
+| File | Description |
+|------|-------------|
+| [`streaming.py`](streaming.py) | ScreenCapture, AudioCapture, ChunkSynchronizer, DuplexGenerator, TTSPlayback |
+| [`chat_minicpmo.py`](chat_minicpmo.py) | CLI with `--live` flag and `/live` interactive command |
 ### Python API