Wan2.1-Fun-V1.1-1.3B-Control (Diffusers format)

Converted alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control to a self-contained HuggingFace Diffusers repo. Includes all components needed for inference: transformer, VAE, text encoder, image encoder, tokenizer, and scheduler.

Quick start

import torch
from diffusers import WanTransformer3DModel, AutoencoderKLWan

# Load the transformer
transformer = WanTransformer3DModel.from_pretrained(
    "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Diffusers",
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

# Load the VAE
vae = AutoencoderKLWan.from_pretrained(
    "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Diffusers",
    subfolder="vae",
    torch_dtype=torch.float32,
)

What is this model?

Wan2.1-Fun-V1.1-1.3B-Control is a 1.56B parameter video generation model from VideoX-Fun that supports control-signal-guided generation (depth, canny, pose, etc.). The control signal is concatenated with noise and image latents along the channel dimension (in_channels=48 = 16 noise + 16 image + 16 control).

Included components

Component Class Size
Transformer WanTransformer3DModel 3.0 GB
Text encoder UMT5EncoderModel 22 GB
Image encoder CLIPVisionModel 1.2 GB
VAE AutoencoderKLWan 485 MB
Tokenizer AutoTokenizer (google/umt5-xxl) 21 MB
Scheduler UniPCMultistepScheduler config only

Shared components (VAE, text encoder, image encoder, tokenizer, scheduler) are from Wan-AI/Wan2.1-T2V-1.3B-Diffusers and Wan-AI/Wan2.1-I2V-14B-480P-Diffusers.

Conversion details

  • Transformer weights converted from VideoX-Fun format to diffusers WanTransformer3DModel using key remapping (983/985 tensors).
  • 2 tensors dropped: ref_conv.weight and ref_conv.bias (99,840 params) -- these implement reference-frame token injection which diffusers' WanTransformer3DModel does not support. The model still works for control-to-video tasks.
  • Forward-pass verified against the original VideoX-Fun model: max absolute difference 1.7e-6 in fp32 (numerical noise from different attention backends).

Limitations

  • ref_conv dropped: no reference-frame token injection (the model can still do depth/canny/pose control)
  • No official WanFunControlPipeline in diffusers yet (huggingface/diffusers#12235) -- custom inference code is needed to handle the 48-channel input (concatenating noise + image + control latents)
  • CLIP image encoder: the original model uses OpenCLIP (xlm-roberta-large-vit-huge-14), which may differ from HuggingFace's CLIPVisionModel included here
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Diffusers

Finetuned
(1)
this model