AEmotionStudio commited on
Commit
ecca29f
·
verified ·
1 Parent(s): f5fb24f

docs: README — remove tts-large-fp8/ section (variant retired; runtime NF4 on tts-large/ covers the smaller-card use case)

Browse files
Files changed (1) hide show
  1. README.md +2 -21
README.md CHANGED
@@ -61,24 +61,7 @@ This is the premium 7B/9B variant of long-form multi-speaker TTS. Microsoft orig
61
  - Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM
62
  - **Shorter** max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog.
63
 
64
- ### About `tts-large-fp8/` — pre-quantized FP8 build
65
-
66
- This is a derivative of `tts-large/` with the LM backbone (Qwen2.5-7B layers — the bulk of the parameter mass at ~14 GB bf16) pre-quantized to **Float8WeightOnly** via `torchao.quantization`. The diffusion head and acoustic/semantic tokenizers (~1.5 GB combined) are **left at bf16** because they're numerically sensitive and the savings on those small modules wouldn't matter anyway.
67
-
68
- | | `tts-large/` (bf16) | `tts-large-fp8/` (this) |
69
- |---|---|---|
70
- | Disk size | ~17.6 GB | **~11 GB** |
71
- | Working set on GPU | ~20 GB | **~13 GB** (10 GB FP8 + 1.5 GB bf16 + 1.5 GB activations/KV) |
72
- | Min GPU | 12 GB (with NF4 runtime cast) | **14 GB** |
73
- | Recommended GPU | 24 GB | **16 GB** |
74
- | Quality | Native | Near-native (FP8 LM weights round-trip cleanly through bf16 compute) |
75
- | Load time | Faster (no per-tensor cast) | Slightly slower (saved-config reconstruction) |
76
-
77
- **How it was made:** loaded `aoi-ot/VibeVoice-Large` via `from_pretrained` with `TorchAoConfig(quant_type=Float8WeightOnlyConfig(), modules_to_not_convert=["diffusion_head", "acoustic_tokenizer", "semantic_tokenizer", "acoustic_connector", "semantic_connector"])`, then `save_pretrained()`. The saved `config.json` carries the quantization spec, so loaders just call `from_pretrained(...)` and the FP8 layers are reconstructed automatically — no special path required.
78
-
79
- **Important — no CPU/disk spillover:** `transformers + torchao` cannot reload this checkpoint via `device_map="auto"` (the meta-init path applies the state_dict before reconstructing the FP8 tensor subclass, hitting `AttributeError: ..._weight_qdata is neither a parameter, buffer, nor extra state`). Loaders MUST use `device_map="cuda"` (eager init), so accelerate-style spillover to CPU/disk is unavailable — the full ~13 GB working set has to fit on the GPU.
80
-
81
- **When to pick this over `tts-large/`:** if your GPU has **16+ GB VRAM** and you want a smaller download that loads at native speed. **On 12 GB cards, use `tts-large/` instead** — it auto-quantizes to NF4 at runtime (~8 GB working set, well-tested), accepts CPU/disk offload if needed, and produces the same quality.
82
 
83
  ### About `realtime-0.5b/voices/*.pt`
84
 
@@ -93,8 +76,7 @@ This is the legacy research variant of VibeVoice ASR — the one that runs clean
93
  | Variant | Task | Languages | Max length | Notes |
94
  |---|---|---|---|---|
95
  | `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags + voice cloning from per-speaker reference clips |
96
- | `tts-large` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality |
97
- | `tts-large-fp8` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Pre-quantized FP8 build of `tts-large` — same workflow, fits 12 GB cards |
98
  | `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
99
  | `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
100
 
@@ -125,7 +107,6 @@ These mitigations are baked into the released weights and are preserved in this
125
  |---|---|---|
126
  | Model weights — 1.5B / ASR / Realtime | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
127
  | Model weights — 7B Large | [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) — community preserve of the now-removed `microsoft/VibeVoice-Large` (uploaded 2025-09-04, one day before Microsoft's pull) | MIT (preserved) |
128
- | Model weights — 7B Large FP8 | Derivative of `aoi-ot/VibeVoice-Large` — pre-quantized via `torchao.quantization.Float8WeightOnlyConfig` (LM backbone only) | MIT (derivative) |
129
  | Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
130
  | Inference code (TTS variants + ASR + Realtime) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
131
 
 
61
  - Larger backbone → marginally better prosody and speaker consistency at the cost of ~3× weight size and ~3× VRAM
62
  - **Shorter** max output: 32K context = ~45 min vs the 1.5B's 64K context = ~90 min. The 1.5B is the better choice for long podcasts; the Large is the better choice for short premium-quality dialog.
63
 
64
+ For users on smaller cards (12 GB / 16 GB), MAESTRO's runner auto-quantizes `tts-large/` to **NF4** via bitsandbytes at runtime (~8 GB working set, well-tested across the HF ecosystem). No separate pre-quantized variant is needed the runtime path produces identical quality to a pre-quantized mirror, with fewer compatibility issues across `transformers` upgrades.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
  ### About `realtime-0.5b/voices/*.pt`
67
 
 
76
  | Variant | Task | Languages | Max length | Notes |
77
  |---|---|---|---|---|
78
  | `tts-1.5b` | Text → speech | EN, ZH (multi-speaker) | ~90 min | Up to 4 speakers via `Speaker N:` script tags + voice cloning from per-speaker reference clips |
79
+ | `tts-large` | Text → speech | EN, ZH (multi-speaker) | ~45 min | Same workflow as 1.5B, premium 7B/9B backbone, higher prosody quality. Auto-quantizes to NF4 at runtime on smaller cards (~8 GB working set). |
 
80
  | `asr-7b` | Speech → text | 50+, code-switching | ~60 min | Diarization, timestamps, hotword support via prompt |
81
  | `realtime-0.5b` | Streaming text → speech | 11 languages (preset-only) | unbounded | ~300 ms first-chunk latency, single speaker |
82
 
 
107
  |---|---|---|
108
  | Model weights — 1.5B / ASR / Realtime | [microsoft/VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B), [microsoft/VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR), [microsoft/VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | MIT |
109
  | Model weights — 7B Large | [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) — community preserve of the now-removed `microsoft/VibeVoice-Large` (uploaded 2025-09-04, one day before Microsoft's pull) | MIT (preserved) |
 
110
  | Voice presets | [microsoft/VibeVoice (GitHub)](https://github.com/microsoft/VibeVoice/tree/main/demo/voices/streaming_model) | MIT |
111
  | Inference code (TTS variants + ASR + Realtime) | [vibevoice-community/VibeVoice](https://github.com/vibevoice-community/VibeVoice) — Microsoft removed `modeling_vibevoice_inference.py` from the original repo on 2025-09-05 | MIT |
112