--- license: apache-2.0 license_link: https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE base_model: openbmb/MiniCPM-o-4_5 tags: - mlx - vision - multimodal - vlm - minicpm - apple-silicon - quantized - audio - tts - speech - whisper - streaming - real-time - screen-capture language: - en - zh - id - fr - de library_name: mlx pipeline_tag: image-text-to-text --- # MiniCPM-o 4.5 — MLX 4-bit Quantized (Full Multimodal) 4-bit quantized [MLX](https://github.com/ml-explore/mlx) conversion of [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) for fast inference on Apple Silicon (M1/M2/M3/M4). Includes **all modalities**: vision, audio input (Whisper), TTS output (CosyVoice2 Llama backbone), and **full duplex streaming** (real-time screen + audio capture). ## Model Details | | | |---|---| | **Base model** | [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) | | **Architecture** | SigLIP2 (27L) + Perceiver Resampler + Whisper Encoder (24L) + Qwen3 LLM (36L) + TTS Llama (20L) | | **Parameters** | ~8B | | **Quantization** | 4-bit (6.031 effective bits) — LLM quantized, all encoders full precision | | **Size on disk** | ~7.0 GB | | **Weight keys** | 1925 total (LLM: 907, Vision: 437, Resampler: 17, Audio: 367, Audio Proj: 4, TTS: 193) | | **Framework** | [MLX](https://github.com/ml-explore/mlx) via [mlx-vlm](https://github.com/Blaizzy/mlx-vlm) | ## Architecture ``` Audio (.wav) --> Mel Spectrogram --> WhisperEncoder (24L, 1024d) --> AudioProjection --> AvgPool(5) --\ | Text --> Tokenizer --> Qwen3 LLM (36L) --> Text Output | / Image --> SigLIP2 (27L) --> Perceiver Resampler (64 queries) --------------------------/ \ LLM hidden states --> TTSProjector --> TTS Llama (20L) --> Audio Tokens ``` ## Performance (M4 Pro, 24 GB RAM) | Mode | Prompt Processing | Generation | Peak Memory | |------|-------------------|------------|-------------| | Text-only | ~60 tok/s | ~55 tok/s | ~7.1 GB | | Image + Text | ~150 tok/s | ~49 tok/s | ~8.3 GB | | Audio + Text | ~85 tok/s | ~55 tok/s | ~8.4 GB | ## Capabilities - **Vision**: Image understanding, OCR, chart/diagram analysis, math solving, visual reasoning - **Audio input**: Speech recognition, audio description, sound classification - **TTS output**: Text-to-speech via CosyVoice2 Llama backbone (requires Token2wav vocoder) - **Multilingual**: English, Chinese, Indonesian, French, German, etc. - **Full duplex streaming**: Real-time screen capture + system audio analysis with continuous LLM output ## Requirements - Apple Silicon Mac (M1 or later) - Python 3.10+ - ~10 GB free RAM (for full multimodal) ```bash pip install mlx-vlm torch transformers Pillow soundfile ``` Optional dependencies: ```bash pip install librosa # Audio resampling (if input isn't 16kHz) pip install minicpmo-utils[all] # Token2wav vocoder for TTS output pip install mss sounddevice # For streaming mode (screen + audio capture) ``` For system audio capture on macOS (streaming mode): ```bash brew install blackhole-2ch ``` Then open **Audio MIDI Setup** > create a **Multi-Output Device** combining your speakers + BlackHole 2ch. ## Quick Start ### Chat Script A standalone [`chat_minicpmo.py`](chat_minicpmo.py) script is included: ```bash # Image input python chat_minicpmo.py photo.jpg -p "What's in this image?" # Audio input python chat_minicpmo.py --audio speech.wav -p "What is being said?" # Audio description python chat_minicpmo.py --audio sound.wav -p "Describe this audio." # Text-only python chat_minicpmo.py -p "Explain quantum computing briefly." # Interactive mode python chat_minicpmo.py # Interactive with pre-loaded audio python chat_minicpmo.py --audio recording.wav # TTS output (requires minicpmo-utils) python chat_minicpmo.py -p "Say hello" --tts --tts-output hello.wav ``` Interactive commands: `/image ` | `/audio ` | `/live` | `/clear` | `/quit` ## Streaming Mode (Full Duplex) Real-time streaming mode captures your screen (1 fps) and system audio (16kHz) simultaneously, feeding them to the model every second for continuous analysis. Think of it as a live AI commentator for whatever's on your screen. **Use cases**: real-time video translation, live captioning, accessibility narration, gameplay commentary, meeting summarization. ### Architecture ``` [Screen Capture 1fps] ──┐ ├──> ChunkSynchronizer ──> Streaming Whisper ──> LLM (KV cache) ──> Text Output [System Audio 16kHz] ───┘ ↑ ↑ ↑ │ MelProcessor Whisper KV cache LLM KV cache │ ▼ TTS Playback (optional) ``` ### Quick Start ```bash # Full duplex streaming (captures primary monitor + system audio) python chat_minicpmo.py --live # Capture specific screen region python chat_minicpmo.py --live --capture-region 0,0,1920,1080 # Use mic instead of system audio python chat_minicpmo.py --live --audio-device "MacBook Pro Microphone" # With TTS output (speaks responses aloud) python chat_minicpmo.py --live --tts # Or start from interactive mode python chat_minicpmo.py > /live ``` Press **Ctrl+C** to stop streaming. ### CLI Options | Flag | Default | Description | |------|---------|-------------| | `--live` | — | Enable full duplex streaming mode | | `--capture-region` | Primary monitor | Screen region as `x,y,w,h` | | `--audio-device` | `BlackHole` | Audio input device name | | `--tts` | Off | Enable TTS speech output | | `--temp` | `0.0` | Sampling temperature | | `--max-tokens` | `512` | Max tokens per chunk response | ### How It Works 1. **Screen capture** (`mss`): Grabs a screenshot at 1 fps, resizes to 448x448, feeds through SigLIP2 vision encoder + Perceiver Resampler (64 tokens). 2. **Audio capture** (`sounddevice`): Records system audio via BlackHole virtual device at 16kHz. Accumulates 1-second chunks. 3. **Streaming Whisper encoder**: Processes audio incrementally using KV cache — no need to re-encode previous audio. Conv1d buffers maintain continuity across chunk boundaries. Auto-resets when reaching 1500 positions. 4. **LLM with KV cache continuation**: Each chunk's vision + audio embeddings are prefilled into the running LLM cache. The model decides whether to listen or speak based on the input. 5. **Text generation**: When the model has something to say, it generates text autoregressively from the cached state. Stops at `<|im_end|>` or mode-switch tokens. 6. **TTS playback** (optional): Generated text is converted to audio tokens via the TTS Llama backbone and played back through speakers using Token2wav. ### Output Format ``` [1] The video shows a person speaking in Indonesian about cooking techniques. >> chunk=1 mode=listen cache=142tok latency=1850ms mem=8.2GB [2] They are now demonstrating how to prepare sambal with a mortar and pestle. >> chunk=2 mode=listen cache=284tok latency=2100ms mem=8.4GB ``` ### System Audio Setup (macOS) To capture system audio (what's playing through your speakers), you need [BlackHole](https://github.com/ExistentialAudio/BlackHole): 1. Install: `brew install blackhole-2ch` 2. Open **Audio MIDI Setup** (Spotlight > "Audio MIDI Setup") 3. Click **+** > **Create Multi-Output Device** 4. Check both **MacBook Pro Speakers** and **BlackHole 2ch** 5. Set this Multi-Output Device as your system output (System Preferences > Sound > Output) 6. Run streaming with default `--audio-device BlackHole` Without BlackHole, use your mic: `--audio-device "MacBook Pro Microphone"` ### Memory & Latency Budget | Component | Memory | Latency | |-----------|--------|---------| | Model weights | ~7.0 GB | — | | LLM KV cache (4096 tok) | ~1.2 GB | — | | Whisper KV cache (1500 pos) | ~0.3 GB | — | | Screen capture | — | ~10ms | | Mel extraction | — | ~50ms | | Whisper streaming encode | — | ~200ms | | Vision encode | — | ~150ms | | LLM prefill (chunk) | — | ~300ms | | LLM generate (50 tok) | — | ~1s | | **Total peak** | **~9.0 GB** | **~2.2s/chunk** | ### Files | File | Description | |------|-------------| | [`streaming.py`](streaming.py) | ScreenCapture, AudioCapture, ChunkSynchronizer, DuplexGenerator, TTSPlayback | | [`chat_minicpmo.py`](chat_minicpmo.py) | CLI with `--live` flag and `/live` interactive command | ### Python API ```python from mlx_vlm import load from mlx_vlm.generate import generate_step import mlx.core as mx model, processor = load("andrevp/MiniCPM-o-4_5-MLX-4bit", trust_remote_code=True) # Text-only text = "<|im_start|>user\nWhat is machine learning?<|im_end|>\n<|im_start|>assistant\n" input_ids = mx.array(processor.tokenizer(text, return_tensors="np")["input_ids"]) tokens = [] for token, _ in generate_step(input_ids, model, None, None, temp=0.0): tok_val = int(token) tokens.append(tok_val) if processor.tokenizer.decode([tok_val]) in ["<|im_end|>", "<|endoftext|>"]: break print(processor.tokenizer.decode(tokens, skip_special_tokens=True)) ``` ### Audio Input (Python API) ```python import soundfile as sf import numpy as np from transformers import WhisperFeatureExtractor # Load and preprocess audio audio, sr = sf.read("speech.wav", dtype="float32") if audio.ndim > 1: audio = audio.mean(axis=1) # stereo to mono # Extract mel spectrogram fe = WhisperFeatureExtractor(feature_size=80, sampling_rate=16000, n_fft=400, hop_length=160) inputs = fe(audio, sampling_rate=16000, return_tensors="pt", padding="max_length", return_attention_mask=True) mel = inputs["input_features"] actual_len = inputs["attention_mask"].sum(dim=1) mel_trimmed = mel[:, :, :int(actual_len[0])] # Convert to MLX and run through audio encoder audio_features = mx.array(mel_trimmed.numpy()) # (1, 80, frames) # Pass audio_features and audio_bound to generate_step via kwargs # See chat_minicpmo.py for the full pipeline ``` ## Component Details ### Audio Encoder (Whisper) - 24-layer Whisper encoder (1024d, 16 heads, 4096 FFN) - Conv1d feature extraction: mel (80 bins) -> conv1 (stride=1) -> conv2 (stride=2) - Learned positional embeddings (max 1500 positions) - Audio projection: 2-layer MLP (1024 -> 4096) with ReLU - Average pooling with stride 5 ### TTS Model (CosyVoice2 Llama) - 20-layer Llama backbone (768d, 12 heads, 3072 FFN) - Text embedding: 152064 tokens -> 768d - Audio codebook: 6562 tokens (1 VQ codebook) - Semantic projector: LLM hidden (4096d) -> TTS hidden (768d) - Speaker projector: LLM hidden (4096d) -> speaker embedding (768d) - Autoregressive generation with temperature + top-p sampling ### Audio Special Tokens | Token | ID | Purpose | |-------|-----|---------| | `<\|audio_start\|>` | 151697 | Start of audio placeholder | | `<\|audio\|>` | 151698 | Audio token | | `<\|audio_end\|>` | 151699 | End of audio placeholder | | `<\|spk_bos\|>` | 151700 | Speaker embedding start | | `<\|spk_eos\|>` | 151702 | Speaker embedding end | | `<\|tts_bos\|>` | 151703 | TTS generation start | | `<\|tts_eos\|>` | 151704 | TTS generation end | ## Quantization Details | Component | Keys | Precision | Notes | |-----------|------|-----------|-------| | Qwen3 LLM (36L) | 907 | 4-bit (group_size=64) | Main language model | | SigLIP2 Vision (27L) | 437 | Full precision | Vision encoder | | Perceiver Resampler | 17 | Full precision | Cross-attention resampler | | Whisper Audio (24L) | 367 | Full precision | Audio encoder | | Audio Projection | 4 | Full precision | 2-layer MLP | | TTS Llama (20L) | 193 | Full precision | Speech synthesis backbone | ## Notes - Audio input requires 16kHz mono WAV. Install `librosa` for automatic resampling from other sample rates. - TTS output generates audio token IDs. Converting to waveform requires the `Token2wav` vocoder from `minicpmo-utils[all]`. - Processes one image per turn, one audio clip per turn. - Quantization may slightly reduce output quality compared to the full-precision model. ## License This model is released under the **Apache-2.0** license, following the original [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) license. See the [original license](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) for full terms. ## Disclaimer > As an LMM, MiniCPM-o 4.5 generates content by learning from a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgments. Anything generated by MiniCPM-o 4.5 does not represent the views and positions of the model developers. We will not be liable for any problems arising from the use of the MiniCPM-o models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. ## Credits - **Original model**: [OpenBMB](https://github.com/OpenBMB) — [MiniCPM-o 4.5](https://huggingface.co/openbmb/MiniCPM-o-4_5) - **MLX framework**: [Apple ML Explore](https://github.com/ml-explore/mlx) - **mlx-vlm**: [Prince Canuma](https://github.com/Blaizzy/mlx-vlm)