andrevp commited on
Commit
2f69d71
Β·
verified Β·
1 Parent(s): 2fe6f6e

Add streaming mode section to README

Browse files
Files changed (1) hide show
  1. README.md +121 -2
README.md CHANGED
@@ -14,6 +14,9 @@ tags:
14
  - tts
15
  - speech
16
  - whisper
 
 
 
17
  language:
18
  - en
19
  - zh
@@ -28,7 +31,7 @@ pipeline_tag: image-text-to-text
28
 
29
  4-bit quantized [MLX](https://github.com/ml-explore/mlx) conversion of [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) for fast inference on Apple Silicon (M1/M2/M3/M4).
30
 
31
- Includes **all modalities**: vision, audio input (Whisper), and TTS output (CosyVoice2 Llama backbone).
32
 
33
  ## Model Details
34
 
@@ -68,6 +71,7 @@ Image --> SigLIP2 (27L) --> Perceiver Resampler (64 queries) -------------------
68
  - **Audio input**: Speech recognition, audio description, sound classification
69
  - **TTS output**: Text-to-speech via CosyVoice2 Llama backbone (requires Token2wav vocoder)
70
  - **Multilingual**: English, Chinese, Indonesian, French, German, etc.
 
71
 
72
  ## Requirements
73
 
@@ -83,8 +87,15 @@ Optional dependencies:
83
  ```bash
84
  pip install librosa # Audio resampling (if input isn't 16kHz)
85
  pip install minicpmo-utils[all] # Token2wav vocoder for TTS output
 
86
  ```
87
 
 
 
 
 
 
 
88
  ## Quick Start
89
 
90
  ### Chat Script
@@ -114,7 +125,115 @@ python chat_minicpmo.py --audio recording.wav
114
  python chat_minicpmo.py -p "Say hello" --tts --tts-output hello.wav
115
  ```
116
 
117
- Interactive commands: `/image <path>` | `/audio <path>` | `/clear` | `/quit`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
  ### Python API
120
 
 
14
  - tts
15
  - speech
16
  - whisper
17
+ - streaming
18
+ - real-time
19
+ - screen-capture
20
  language:
21
  - en
22
  - zh
 
31
 
32
  4-bit quantized [MLX](https://github.com/ml-explore/mlx) conversion of [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) for fast inference on Apple Silicon (M1/M2/M3/M4).
33
 
34
+ Includes **all modalities**: vision, audio input (Whisper), TTS output (CosyVoice2 Llama backbone), and **full duplex streaming** (real-time screen + audio capture).
35
 
36
  ## Model Details
37
 
 
71
  - **Audio input**: Speech recognition, audio description, sound classification
72
  - **TTS output**: Text-to-speech via CosyVoice2 Llama backbone (requires Token2wav vocoder)
73
  - **Multilingual**: English, Chinese, Indonesian, French, German, etc.
74
+ - **Full duplex streaming**: Real-time screen capture + system audio analysis with continuous LLM output
75
 
76
  ## Requirements
77
 
 
87
  ```bash
88
  pip install librosa # Audio resampling (if input isn't 16kHz)
89
  pip install minicpmo-utils[all] # Token2wav vocoder for TTS output
90
+ pip install mss sounddevice # For streaming mode (screen + audio capture)
91
  ```
92
 
93
+ For system audio capture on macOS (streaming mode):
94
+ ```bash
95
+ brew install blackhole-2ch
96
+ ```
97
+ Then open **Audio MIDI Setup** > create a **Multi-Output Device** combining your speakers + BlackHole 2ch.
98
+
99
  ## Quick Start
100
 
101
  ### Chat Script
 
125
  python chat_minicpmo.py -p "Say hello" --tts --tts-output hello.wav
126
  ```
127
 
128
+ Interactive commands: `/image <path>` | `/audio <path>` | `/live` | `/clear` | `/quit`
129
+
130
+ ## Streaming Mode (Full Duplex)
131
+
132
+ Real-time streaming mode captures your screen (1 fps) and system audio (16kHz) simultaneously, feeding them to the model every second for continuous analysis. Think of it as a live AI commentator for whatever's on your screen.
133
+
134
+ **Use cases**: real-time video translation, live captioning, accessibility narration, gameplay commentary, meeting summarization.
135
+
136
+ ### Architecture
137
+
138
+ ```
139
+ [Screen Capture 1fps] ──┐
140
+ β”œβ”€β”€> ChunkSynchronizer ──> Streaming Whisper ──> LLM (KV cache) ──> Text Output
141
+ [System Audio 16kHz] β”€β”€β”€β”˜ ↑ ↑ ↑ β”‚
142
+ MelProcessor Whisper KV cache LLM KV cache β”‚
143
+ β–Ό
144
+ TTS Playback (optional)
145
+ ```
146
+
147
+ ### Quick Start
148
+
149
+ ```bash
150
+ # Full duplex streaming (captures primary monitor + system audio)
151
+ python chat_minicpmo.py --live
152
+
153
+ # Capture specific screen region
154
+ python chat_minicpmo.py --live --capture-region 0,0,1920,1080
155
+
156
+ # Use mic instead of system audio
157
+ python chat_minicpmo.py --live --audio-device "MacBook Pro Microphone"
158
+
159
+ # With TTS output (speaks responses aloud)
160
+ python chat_minicpmo.py --live --tts
161
+
162
+ # Or start from interactive mode
163
+ python chat_minicpmo.py
164
+ > /live
165
+ ```
166
+
167
+ Press **Ctrl+C** to stop streaming.
168
+
169
+ ### CLI Options
170
+
171
+ | Flag | Default | Description |
172
+ |------|---------|-------------|
173
+ | `--live` | β€” | Enable full duplex streaming mode |
174
+ | `--capture-region` | Primary monitor | Screen region as `x,y,w,h` |
175
+ | `--audio-device` | `BlackHole` | Audio input device name |
176
+ | `--tts` | Off | Enable TTS speech output |
177
+ | `--temp` | `0.0` | Sampling temperature |
178
+ | `--max-tokens` | `512` | Max tokens per chunk response |
179
+
180
+ ### How It Works
181
+
182
+ 1. **Screen capture** (`mss`): Grabs a screenshot at 1 fps, resizes to 448x448, feeds through SigLIP2 vision encoder + Perceiver Resampler (64 tokens).
183
+
184
+ 2. **Audio capture** (`sounddevice`): Records system audio via BlackHole virtual device at 16kHz. Accumulates 1-second chunks.
185
+
186
+ 3. **Streaming Whisper encoder**: Processes audio incrementally using KV cache β€” no need to re-encode previous audio. Conv1d buffers maintain continuity across chunk boundaries. Auto-resets when reaching 1500 positions.
187
+
188
+ 4. **LLM with KV cache continuation**: Each chunk's vision + audio embeddings are prefilled into the running LLM cache. The model decides whether to listen or speak based on the input.
189
+
190
+ 5. **Text generation**: When the model has something to say, it generates text autoregressively from the cached state. Stops at `<|im_end|>` or mode-switch tokens.
191
+
192
+ 6. **TTS playback** (optional): Generated text is converted to audio tokens via the TTS Llama backbone and played back through speakers using Token2wav.
193
+
194
+ ### Output Format
195
+
196
+ ```
197
+ [1] The video shows a person speaking in Indonesian about cooking techniques.
198
+ >> chunk=1 mode=listen cache=142tok latency=1850ms mem=8.2GB
199
+ [2] They are now demonstrating how to prepare sambal with a mortar and pestle.
200
+ >> chunk=2 mode=listen cache=284tok latency=2100ms mem=8.4GB
201
+ ```
202
+
203
+ ### System Audio Setup (macOS)
204
+
205
+ To capture system audio (what's playing through your speakers), you need [BlackHole](https://github.com/ExistentialAudio/BlackHole):
206
+
207
+ 1. Install: `brew install blackhole-2ch`
208
+ 2. Open **Audio MIDI Setup** (Spotlight > "Audio MIDI Setup")
209
+ 3. Click **+** > **Create Multi-Output Device**
210
+ 4. Check both **MacBook Pro Speakers** and **BlackHole 2ch**
211
+ 5. Set this Multi-Output Device as your system output (System Preferences > Sound > Output)
212
+ 6. Run streaming with default `--audio-device BlackHole`
213
+
214
+ Without BlackHole, use your mic: `--audio-device "MacBook Pro Microphone"`
215
+
216
+ ### Memory & Latency Budget
217
+
218
+ | Component | Memory | Latency |
219
+ |-----------|--------|---------|
220
+ | Model weights | ~7.0 GB | β€” |
221
+ | LLM KV cache (4096 tok) | ~1.2 GB | β€” |
222
+ | Whisper KV cache (1500 pos) | ~0.3 GB | β€” |
223
+ | Screen capture | β€” | ~10ms |
224
+ | Mel extraction | β€” | ~50ms |
225
+ | Whisper streaming encode | β€” | ~200ms |
226
+ | Vision encode | β€” | ~150ms |
227
+ | LLM prefill (chunk) | β€” | ~300ms |
228
+ | LLM generate (50 tok) | β€” | ~1s |
229
+ | **Total peak** | **~9.0 GB** | **~2.2s/chunk** |
230
+
231
+ ### Files
232
+
233
+ | File | Description |
234
+ |------|-------------|
235
+ | [`streaming.py`](streaming.py) | ScreenCapture, AudioCapture, ChunkSynchronizer, DuplexGenerator, TTSPlayback |
236
+ | [`chat_minicpmo.py`](chat_minicpmo.py) | CLI with `--live` flag and `/live` interactive command |
237
 
238
  ### Python API
239