Wan2.1-Fun-V1.1-1.3B-Control (Diffusers format)
Converted alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control to a self-contained HuggingFace Diffusers repo. Includes all components needed for inference: transformer, VAE, text encoder, image encoder, tokenizer, and scheduler.
Quick start
import torch
from diffusers import WanTransformer3DModel, AutoencoderKLWan
# Load the transformer
transformer = WanTransformer3DModel.from_pretrained(
"the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Diffusers",
subfolder="transformer",
torch_dtype=torch.bfloat16,
)
# Load the VAE
vae = AutoencoderKLWan.from_pretrained(
"the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Diffusers",
subfolder="vae",
torch_dtype=torch.float32,
)
What is this model?
Wan2.1-Fun-V1.1-1.3B-Control is a 1.56B parameter video generation model from VideoX-Fun that supports control-signal-guided generation (depth, canny, pose, etc.). The control signal is concatenated with noise and image latents along the channel dimension (in_channels=48 = 16 noise + 16 image + 16 control).
Included components
| Component | Class | Size |
|---|---|---|
| Transformer | WanTransformer3DModel |
3.0 GB |
| Text encoder | UMT5EncoderModel |
22 GB |
| Image encoder | CLIPVisionModel |
1.2 GB |
| VAE | AutoencoderKLWan |
485 MB |
| Tokenizer | AutoTokenizer (google/umt5-xxl) |
21 MB |
| Scheduler | UniPCMultistepScheduler |
config only |
Shared components (VAE, text encoder, image encoder, tokenizer, scheduler) are from Wan-AI/Wan2.1-T2V-1.3B-Diffusers and Wan-AI/Wan2.1-I2V-14B-480P-Diffusers.
Conversion details
- Transformer weights converted from VideoX-Fun format to diffusers
WanTransformer3DModelusing key remapping (983/985 tensors). - 2 tensors dropped:
ref_conv.weightandref_conv.bias(99,840 params) -- these implement reference-frame token injection which diffusers'WanTransformer3DModeldoes not support. The model still works for control-to-video tasks. - Forward-pass verified against the original VideoX-Fun model: max absolute difference 1.7e-6 in fp32 (numerical noise from different attention backends).
Limitations
ref_convdropped: no reference-frame token injection (the model can still do depth/canny/pose control)- No official
WanFunControlPipelinein diffusers yet (huggingface/diffusers#12235) -- custom inference code is needed to handle the 48-channel input (concatenating noise + image + control latents) - CLIP image encoder: the original model uses OpenCLIP (xlm-roberta-large-vit-huge-14), which may differ from HuggingFace's CLIPVisionModel included here
- Downloads last month
- 19
Model tree for the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Diffusers
Base model
alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control