---
license: apache-2.0
license_link: https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE
base_model: openbmb/MiniCPM-o-4_5
tags:
  - mlx
  - vision
  - multimodal
  - vlm
  - minicpm
  - apple-silicon
  - quantized
  - audio
  - tts
  - speech
  - whisper
  - streaming
  - real-time
  - screen-capture
language:
  - en
  - zh
  - id
  - fr
  - de
library_name: mlx
pipeline_tag: image-text-to-text
---

# MiniCPM-o 4.5 — MLX 4-bit Quantized (Full Multimodal)

4-bit quantized [MLX](https://github.com/ml-explore/mlx) conversion of [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) for fast inference on Apple Silicon (M1/M2/M3/M4).

Includes **all modalities**: vision, audio input (Whisper), TTS output (CosyVoice2 Llama backbone), and **full duplex streaming** (real-time screen + audio capture).

## Model Details

| | |
|---|---|
| **Base model** | [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) |
| **Architecture** | SigLIP2 (27L) + Perceiver Resampler + Whisper Encoder (24L) + Qwen3 LLM (36L) + TTS Llama (20L) |
| **Parameters** | ~8B |
| **Quantization** | 4-bit (6.031 effective bits) — LLM quantized, all encoders full precision |
| **Size on disk** | ~7.0 GB |
| **Weight keys** | 1925 total (LLM: 907, Vision: 437, Resampler: 17, Audio: 367, Audio Proj: 4, TTS: 193) |
| **Framework** | [MLX](https://github.com/ml-explore/mlx) via [mlx-vlm](https://github.com/Blaizzy/mlx-vlm) |

## Architecture

```
Audio (.wav) --> Mel Spectrogram --> WhisperEncoder (24L, 1024d) --> AudioProjection --> AvgPool(5) --\
                                                                                                     |
                                                              Text --> Tokenizer --> Qwen3 LLM (36L) --> Text Output
                                                                                       |            /
Image --> SigLIP2 (27L) --> Perceiver Resampler (64 queries) --------------------------/
                                                                                       \
                                                         LLM hidden states --> TTSProjector --> TTS Llama (20L) --> Audio Tokens
```

## Performance (M4 Pro, 24 GB RAM)

| Mode | Prompt Processing | Generation | Peak Memory |
|------|-------------------|------------|-------------|
| Text-only | ~60 tok/s | ~55 tok/s | ~7.1 GB |
| Image + Text | ~150 tok/s | ~49 tok/s | ~8.3 GB |
| Audio + Text | ~85 tok/s | ~55 tok/s | ~8.4 GB |

## Capabilities

- **Vision**: Image understanding, OCR, chart/diagram analysis, math solving, visual reasoning
- **Audio input**: Speech recognition, audio description, sound classification
- **TTS output**: Text-to-speech via CosyVoice2 Llama backbone (requires Token2wav vocoder)
- **Multilingual**: English, Chinese, Indonesian, French, German, etc.
- **Full duplex streaming**: Real-time screen capture + system audio analysis with continuous LLM output

## Requirements

- Apple Silicon Mac (M1 or later)
- Python 3.10+
- ~10 GB free RAM (for full multimodal)

```bash
pip install mlx-vlm torch transformers Pillow soundfile
```

Optional dependencies:
```bash
pip install librosa                # Audio resampling (if input isn't 16kHz)
pip install minicpmo-utils[all]    # Token2wav vocoder for TTS output
pip install mss sounddevice        # For streaming mode (screen + audio capture)
```

For system audio capture on macOS (streaming mode):
```bash
brew install blackhole-2ch
```
Then open **Audio MIDI Setup** > create a **Multi-Output Device** combining your speakers + BlackHole 2ch.

## Quick Start

### Chat Script

A standalone [`chat_minicpmo.py`](chat_minicpmo.py) script is included:

```bash
# Image input
python chat_minicpmo.py photo.jpg -p "What's in this image?"

# Audio input
python chat_minicpmo.py --audio speech.wav -p "What is being said?"

# Audio description
python chat_minicpmo.py --audio sound.wav -p "Describe this audio."

# Text-only
python chat_minicpmo.py -p "Explain quantum computing briefly."

# Interactive mode
python chat_minicpmo.py

# Interactive with pre-loaded audio
python chat_minicpmo.py --audio recording.wav

# TTS output (requires minicpmo-utils)
python chat_minicpmo.py -p "Say hello" --tts --tts-output hello.wav
```

Interactive commands: `/image <path>` | `/audio <path>` | `/live` | `/clear` | `/quit`

## Streaming Mode (Full Duplex)

Real-time streaming mode captures your screen (1 fps) and system audio (16kHz) simultaneously, feeding them to the model every second for continuous analysis. Think of it as a live AI commentator for whatever's on your screen.

**Use cases**: real-time video translation, live captioning, accessibility narration, gameplay commentary, meeting summarization.

### Architecture

```
[Screen Capture 1fps] ──┐
                        ├──> ChunkSynchronizer ──> Streaming Whisper ──> LLM (KV cache) ──> Text Output
[System Audio 16kHz] ───┘         ↑                      ↑                    ↑                  │
                            MelProcessor          Whisper KV cache       LLM KV cache            │
                                                                                                  ▼
                                                                                          TTS Playback (optional)
```

### Quick Start

```bash
# Full duplex streaming (captures primary monitor + system audio)
python chat_minicpmo.py --live

# Capture specific screen region
python chat_minicpmo.py --live --capture-region 0,0,1920,1080

# Use mic instead of system audio
python chat_minicpmo.py --live --audio-device "MacBook Pro Microphone"

# With TTS output (speaks responses aloud)
python chat_minicpmo.py --live --tts

# Or start from interactive mode
python chat_minicpmo.py
> /live
```

Press **Ctrl+C** to stop streaming.

### CLI Options

| Flag | Default | Description |
|------|---------|-------------|
| `--live` | — | Enable full duplex streaming mode |
| `--capture-region` | Primary monitor | Screen region as `x,y,w,h` |
| `--audio-device` | `BlackHole` | Audio input device name |
| `--tts` | Off | Enable TTS speech output |
| `--temp` | `0.0` | Sampling temperature |
| `--max-tokens` | `512` | Max tokens per chunk response |

### How It Works

1. **Screen capture** (`mss`): Grabs a screenshot at 1 fps, resizes to 448x448, feeds through SigLIP2 vision encoder + Perceiver Resampler (64 tokens).

2. **Audio capture** (`sounddevice`): Records system audio via BlackHole virtual device at 16kHz. Accumulates 1-second chunks.

3. **Streaming Whisper encoder**: Processes audio incrementally using KV cache — no need to re-encode previous audio. Conv1d buffers maintain continuity across chunk boundaries. Auto-resets when reaching 1500 positions.

4. **LLM with KV cache continuation**: Each chunk's vision + audio embeddings are prefilled into the running LLM cache. The model decides whether to listen or speak based on the input.

5. **Text generation**: When the model has something to say, it generates text autoregressively from the cached state. Stops at `<|im_end|>` or mode-switch tokens.

6. **TTS playback** (optional): Generated text is converted to audio tokens via the TTS Llama backbone and played back through speakers using Token2wav.

### Output Format

```
[1] The video shows a person speaking in Indonesian about cooking techniques.
  >> chunk=1 mode=listen cache=142tok latency=1850ms mem=8.2GB
[2] They are now demonstrating how to prepare sambal with a mortar and pestle.
  >> chunk=2 mode=listen cache=284tok latency=2100ms mem=8.4GB
```

### System Audio Setup (macOS)

To capture system audio (what's playing through your speakers), you need [BlackHole](https://github.com/ExistentialAudio/BlackHole):

1. Install: `brew install blackhole-2ch`
2. Open **Audio MIDI Setup** (Spotlight > "Audio MIDI Setup")
3. Click **+** > **Create Multi-Output Device**
4. Check both **MacBook Pro Speakers** and **BlackHole 2ch**
5. Set this Multi-Output Device as your system output (System Preferences > Sound > Output)
6. Run streaming with default `--audio-device BlackHole`

Without BlackHole, use your mic: `--audio-device "MacBook Pro Microphone"`

### Memory & Latency Budget

| Component | Memory | Latency |
|-----------|--------|---------|
| Model weights | ~7.0 GB | — |
| LLM KV cache (4096 tok) | ~1.2 GB | — |
| Whisper KV cache (1500 pos) | ~0.3 GB | — |
| Screen capture | — | ~10ms |
| Mel extraction | — | ~50ms |
| Whisper streaming encode | — | ~200ms |
| Vision encode | — | ~150ms |
| LLM prefill (chunk) | — | ~300ms |
| LLM generate (50 tok) | — | ~1s |
| **Total peak** | **~9.0 GB** | **~2.2s/chunk** |

### Files

| File | Description |
|------|-------------|
| [`streaming.py`](streaming.py) | ScreenCapture, AudioCapture, ChunkSynchronizer, DuplexGenerator, TTSPlayback |
| [`chat_minicpmo.py`](chat_minicpmo.py) | CLI with `--live` flag and `/live` interactive command |

### Python API

```python
from mlx_vlm import load
from mlx_vlm.generate import generate_step
import mlx.core as mx

model, processor = load("andrevp/MiniCPM-o-4_5-MLX-4bit", trust_remote_code=True)

# Text-only
text = "<|im_start|>user\nWhat is machine learning?<|im_end|>\n<|im_start|>assistant\n"
input_ids = mx.array(processor.tokenizer(text, return_tensors="np")["input_ids"])

tokens = []
for token, _ in generate_step(input_ids, model, None, None, temp=0.0):
    tok_val = int(token)
    tokens.append(tok_val)
    if processor.tokenizer.decode([tok_val]) in ["<|im_end|>", "<|endoftext|>"]:
        break

print(processor.tokenizer.decode(tokens, skip_special_tokens=True))
```

### Audio Input (Python API)

```python
import soundfile as sf
import numpy as np
from transformers import WhisperFeatureExtractor

# Load and preprocess audio
audio, sr = sf.read("speech.wav", dtype="float32")
if audio.ndim > 1:
    audio = audio.mean(axis=1)  # stereo to mono

# Extract mel spectrogram
fe = WhisperFeatureExtractor(feature_size=80, sampling_rate=16000, n_fft=400, hop_length=160)
inputs = fe(audio, sampling_rate=16000, return_tensors="pt", padding="max_length", return_attention_mask=True)
mel = inputs["input_features"]
actual_len = inputs["attention_mask"].sum(dim=1)
mel_trimmed = mel[:, :, :int(actual_len[0])]

# Convert to MLX and run through audio encoder
audio_features = mx.array(mel_trimmed.numpy())  # (1, 80, frames)

# Pass audio_features and audio_bound to generate_step via kwargs
# See chat_minicpmo.py for the full pipeline
```

## Component Details

### Audio Encoder (Whisper)
- 24-layer Whisper encoder (1024d, 16 heads, 4096 FFN)
- Conv1d feature extraction: mel (80 bins) -> conv1 (stride=1) -> conv2 (stride=2)
- Learned positional embeddings (max 1500 positions)
- Audio projection: 2-layer MLP (1024 -> 4096) with ReLU
- Average pooling with stride 5

### TTS Model (CosyVoice2 Llama)
- 20-layer Llama backbone (768d, 12 heads, 3072 FFN)
- Text embedding: 152064 tokens -> 768d
- Audio codebook: 6562 tokens (1 VQ codebook)
- Semantic projector: LLM hidden (4096d) -> TTS hidden (768d)
- Speaker projector: LLM hidden (4096d) -> speaker embedding (768d)
- Autoregressive generation with temperature + top-p sampling

### Audio Special Tokens

| Token | ID | Purpose |
|-------|-----|---------|
| `<\|audio_start\|>` | 151697 | Start of audio placeholder |
| `<\|audio\|>` | 151698 | Audio token |
| `<\|audio_end\|>` | 151699 | End of audio placeholder |
| `<\|spk_bos\|>` | 151700 | Speaker embedding start |
| `<\|spk_eos\|>` | 151702 | Speaker embedding end |
| `<\|tts_bos\|>` | 151703 | TTS generation start |
| `<\|tts_eos\|>` | 151704 | TTS generation end |

## Quantization Details

| Component | Keys | Precision | Notes |
|-----------|------|-----------|-------|
| Qwen3 LLM (36L) | 907 | 4-bit (group_size=64) | Main language model |
| SigLIP2 Vision (27L) | 437 | Full precision | Vision encoder |
| Perceiver Resampler | 17 | Full precision | Cross-attention resampler |
| Whisper Audio (24L) | 367 | Full precision | Audio encoder |
| Audio Projection | 4 | Full precision | 2-layer MLP |
| TTS Llama (20L) | 193 | Full precision | Speech synthesis backbone |

## Notes

- Audio input requires 16kHz mono WAV. Install `librosa` for automatic resampling from other sample rates.
- TTS output generates audio token IDs. Converting to waveform requires the `Token2wav` vocoder from `minicpmo-utils[all]`.
- Processes one image per turn, one audio clip per turn.
- Quantization may slightly reduce output quality compared to the full-precision model.

## License

This model is released under the **Apache-2.0** license, following the original [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) license.

See the [original license](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) for full terms.

## Disclaimer

> As an LMM, MiniCPM-o 4.5 generates content by learning from a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgments. Anything generated by MiniCPM-o 4.5 does not represent the views and positions of the model developers. We will not be liable for any problems arising from the use of the MiniCPM-o models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

## Credits

- **Original model**: [OpenBMB](https://github.com/OpenBMB) — [MiniCPM-o 4.5](https://huggingface.co/openbmb/MiniCPM-o-4_5)
- **MLX framework**: [Apple ML Explore](https://github.com/ml-explore/mlx)
- **mlx-vlm**: [Prince Canuma](https://github.com/Blaizzy/mlx-vlm)