VibeVoice — AEmotion Studio Mirror
This repository is a MAESTRO-curated mirror of Microsoft's VibeVoice family, with the long-form-TTS inference code restored from the vibevoice-community/VibeVoice fork. All weights, code, and assets remain under the upstream MIT License.
It exists so MAESTRO's downloader can fetch each variant (and its dependencies) from a single, predictably-laid-out repo with allow_patterns filtering, instead of spreading across three separate Microsoft HF repos plus GitHub.
Layout
vibevoice-models/
├── tts-1.5b/ ← microsoft/VibeVoice-1.5B (5.4 GB, 64K ctx, ~90 min max output)
│ ├── config.json
│ ├── preprocessor_config.json
│ ├── model-0000{1..3}-of-00003.safetensors
│ └── …
├── tts-large/ ← aoi-ot/VibeVoice-Large (17.6 GB, 32K ctx, ~45 min max output, premium 7B/9B backbone)
│ ├── config.json
│ ├── preprocessor_config.json
│ ├── configuration.json
│ ├── model-0000{1..10}-of-00010.safetensors
│ └── …
├── asr-7b/ ← microsoft/VibeVoice-ASR (17.4 GB, legacy/research variant)
│ ├── config.json
│ ├── model-0000{1..8}-of-00008.safetensors
│ └── …
└── realtime-0.5b/ ← microsoft/VibeVoice-Realtime-0.5B (2.0 GB + 100 MB voices)
├── config.json
├── preprocessor_config.json
├── model.safetensors
└── voices/ ← 25 baked-in voice presets (KV-cache .pt files, NOT model weights)
├── en-Carter_man.pt
├── en-Frank_man.pt
├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)
About tts-large/
This is the premium 7B/9B variant of long-form multi-speaker TTS. Microsoft originally published it at microsoft/VibeVoice-Large with full MIT license, then removed the repo on 2025-09-05 along with the demo scripts (same RAI cleanup that removed modeling_vibevoice_inference.py). The MIT license remains in force on the released weights — this mirror sources from aoi-ot/VibeVoice-Large, a community preserve uploaded on 2025-09-04 (one day before Microsoft's pull) that retains the full original release.
Differences from tts-1.5b/:
- Same
Speaker N:script format, same voice-cloning workflow (pass per-speaker reference clips) - Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM
- Shorter max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog.
For users on smaller cards (12 GB / 16 GB), MAESTRO's runner auto-quantizes tts-large/ to NF4 via bitsandbytes at runtime (~8 GB working set, well-tested across the HF ecosystem). No separate pre-quantized variant is needed — the runtime path produces identical quality to a pre-quantized mirror, with fewer compatibility issues across transformers upgrades.
About realtime-0.5b/voices/*.pt
These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then torch.save'd). They are not flat tensor maps so they cannot be converted to safetensors — they must be loaded with torch.load(..., weights_only=False). Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.
About asr-7b/
This is the legacy research variant of VibeVoice ASR — the one that runs cleanly on transformers>=4.51.3,<5.0.0. Microsoft also publishes a microsoft/VibeVoice-ASR-HF repo with the cleaner apply_transcription_request API, but that variant requires transformers>=5.3.0 which is not yet compatible with the rest of MAESTRO's model stack.
Variant capabilities
| Variant | Task | Languages | Max length | Notes |
|---|---|---|---|---|
tts-1.5b |
Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via Speaker N: script tags + voice cloning from per-speaker reference clips |
tts-large |
Text → speech | EN, ZH (multi-speaker) | ~45 min | Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality. Auto-quantizes to NF4 at runtime on smaller cards (~8 GB working set). |
asr-7b |
Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
realtime-0.5b |
Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
Responsible use (verbatim from upstream model cards)
VibeVoice is limited to research-purpose use exploring highly realistic audio dialogue generation.
The following are explicitly out of scope:
- Voice impersonation without explicit, recorded consent
- Disinformation or impersonation
- Real-time or low-latency voice conversion for live deep-fakes
- Generation in unsupported languages (non-English, non-Chinese)
- Generation of background ambience, Foley, or music
- Circumventing the watermark or audible disclaimer
We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only.
To mitigate misuse, Microsoft has:
- Embedded an audible disclaimer ("This segment was generated by AI") in TTS outputs.
- Added an imperceptible perceptual watermark to all generated audio.
- Logged inference requests (hashed) for abuse-pattern detection.
These mitigations are baked into the released weights and are preserved in this mirror.
Attribution
| Component | Source | License |
|---|---|---|
| Model weights — 1.5B / ASR / Realtime | microsoft/VibeVoice-1.5B, microsoft/VibeVoice-ASR, microsoft/VibeVoice-Realtime-0.5B | MIT |
| Model weights — 7B Large | aoi-ot/VibeVoice-Large — community preserve of the now-removed microsoft/VibeVoice-Large (uploaded 2025-09-04, one day before Microsoft's pull) |
MIT (preserved) |
| Voice presets | microsoft/VibeVoice (GitHub) | MIT |
| Inference code (TTS variants + ASR + Realtime) | vibevoice-community/VibeVoice — Microsoft removed modeling_vibevoice_inference.py from the original repo on 2025-09-05 |
MIT |
Citation
@misc{peng2025vibevoicetechnicalreport,
title = {VibeVoice Technical Report},
author = {Zhiliang Peng and Jianwei Yu and Wenhui Wang and Yaoyao Chang and
Yutao Sun and Li Dong and Yi Zhu and Weijiang Xu and Hangbo Bao and
Zehua Wang and Shaohan Huang and Yan Xia and Furu Wei},
year = {2025},
eprint = {2508.19205},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2508.19205}
}