VibeVoice — AEmotion Studio Mirror

This repository is a MAESTRO-curated mirror of Microsoft's VibeVoice family, with the long-form-TTS inference code restored from the vibevoice-community/VibeVoice fork. All weights, code, and assets remain under the upstream MIT License.

It exists so MAESTRO's downloader can fetch each variant (and its dependencies) from a single, predictably-laid-out repo with allow_patterns filtering, instead of spreading across three separate Microsoft HF repos plus GitHub.

Layout

vibevoice-models/
├── tts-1.5b/           ← microsoft/VibeVoice-1.5B  (5.4 GB, 64K ctx, ~90 min max output)
│   ├── config.json
│   ├── preprocessor_config.json
│   ├── model-0000{1..3}-of-00003.safetensors
│   └── …
├── tts-large/          ← aoi-ot/VibeVoice-Large    (17.6 GB, 32K ctx, ~45 min max output, premium 7B/9B backbone)
│   ├── config.json
│   ├── preprocessor_config.json
│   ├── configuration.json
│   ├── model-0000{1..10}-of-00010.safetensors
│   └── …
├── asr-7b/             ← microsoft/VibeVoice-ASR    (17.4 GB, legacy/research variant)
│   ├── config.json
│   ├── model-0000{1..8}-of-00008.safetensors
│   └── …
└── realtime-0.5b/      ← microsoft/VibeVoice-Realtime-0.5B  (2.0 GB + 100 MB voices)
    ├── config.json
    ├── preprocessor_config.json
    ├── model.safetensors
    └── voices/         ← 25 baked-in voice presets (KV-cache .pt files, NOT model weights)
        ├── en-Carter_man.pt
        ├── en-Frank_man.pt
        ├── … (23 more, grouped by language: de/en/fr/in/it/jp/kr/nl/pl/pt/sp)

About `tts-large/`

This is the premium 7B/9B variant of long-form multi-speaker TTS. Microsoft originally published it at microsoft/VibeVoice-Large with full MIT license, then removed the repo on 2025-09-05 along with the demo scripts (same RAI cleanup that removed modeling_vibevoice_inference.py). The MIT license remains in force on the released weights — this mirror sources from aoi-ot/VibeVoice-Large, a community preserve uploaded on 2025-09-04 (one day before Microsoft's pull) that retains the full original release.

Differences from tts-1.5b/:

Same Speaker N: script format, same voice-cloning workflow (pass per-speaker reference clips)
Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM
Shorter max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog.

For users on smaller cards (12 GB / 16 GB), MAESTRO's runner auto-quantizes tts-large/ to NF4 via bitsandbytes at runtime (~8 GB working set, well-tested across the HF ecosystem). No separate pre-quantized variant is needed — the runtime path produces identical quality to a pre-quantized mirror, with fewer compatibility issues across transformers upgrades.

About `realtime-0.5b/voices/*.pt`

These are pickled KV-cache dictionaries (output of running each voice prompt through the full model once, then torch.save'd). They are not flat tensor maps so they cannot be converted to safetensors — they must be loaded with torch.load(..., weights_only=False). Each is 2–7 MB. Microsoft does not publish the acoustic tokenizer that would let users generate new ones, so this set of 25 is the complete preset library.

About `asr-7b/`

This is the legacy research variant of VibeVoice ASR — the one that runs cleanly on transformers>=4.51.3,<5.0.0. Microsoft also publishes a microsoft/VibeVoice-ASR-HF repo with the cleaner apply_transcription_request API, but that variant requires transformers>=5.3.0 which is not yet compatible with the rest of MAESTRO's model stack.

Variant capabilities

Variant	Task	Languages	Max length	Notes
`tts-1.5b`	Text → speech	EN, ZH (multi-speaker)	~90 min	Up to 4 speakers via `Speaker N:` script tags + voice cloning from per-speaker reference clips
`tts-large`	Text → speech	EN, ZH (multi-speaker)	~45 min	Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality. Auto-quantizes to NF4 at runtime on smaller cards (~8 GB working set).
`asr-7b`	Speech → text	50+, code-switching	~60 min	Diarization, timestamps, hotword support via prompt
`realtime-0.5b`	Streaming text → speech	11 languages (preset-only)	unbounded	~300 ms first-chunk latency, single speaker

Responsible use (verbatim from upstream model cards)

VibeVoice is limited to research-purpose use exploring highly realistic audio dialogue generation.

The following are explicitly out of scope:

Voice impersonation without explicit, recorded consent

Disinformation or impersonation

Real-time or low-latency voice conversion for live deep-fakes

Generation in unsupported languages (non-English, non-Chinese)

Generation of background ambience, Foley, or music

Circumventing the watermark or audible disclaimer

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only.

To mitigate misuse, Microsoft has:

Embedded an audible disclaimer ("This segment was generated by AI") in TTS outputs.
Added an imperceptible perceptual watermark to all generated audio.
Logged inference requests (hashed) for abuse-pattern detection.

These mitigations are baked into the released weights and are preserved in this mirror.

Attribution

Component	Source	License
Model weights — 1.5B / ASR / Realtime	microsoft/VibeVoice-1.5B, microsoft/VibeVoice-ASR, microsoft/VibeVoice-Realtime-0.5B	MIT
Model weights — 7B Large	aoi-ot/VibeVoice-Large — community preserve of the now-removed `microsoft/VibeVoice-Large` (uploaded 2025-09-04, one day before Microsoft's pull)	MIT (preserved)
Voice presets	microsoft/VibeVoice (GitHub)	MIT
Inference code (TTS variants + ASR + Realtime)	vibevoice-community/VibeVoice — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05	MIT

Citation

@misc{peng2025vibevoicetechnicalreport,
  title  = {VibeVoice Technical Report},
  author = {Zhiliang Peng and Jianwei Yu and Wenhui Wang and Yaoyao Chang and
            Yutao Sun and Li Dong and Yi Zhu and Weijiang Xu and Hangbo Bao and
            Zehua Wang and Shaohan Huang and Yan Xia and Furu Wei},
  year   = {2025},
  eprint = {2508.19205},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url    = {https://arxiv.org/abs/2508.19205}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for AEmotionStudio/vibevoice-models

VibeVoice Technical Report

Paper • 2508.19205 • Published Aug 26, 2025 • 168