Wan2.1-Fun-V1.1-1.3B-Control (Diffusers format)

Converted alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control to a self-contained HuggingFace Diffusers repo. Includes all components needed for inference: transformer, VAE, text encoder, image encoder, tokenizer, and scheduler.

Quick start

import torch
from diffusers import WanTransformer3DModel, AutoencoderKLWan

# Load the transformer
transformer = WanTransformer3DModel.from_pretrained(
    "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Diffusers",
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

# Load the VAE
vae = AutoencoderKLWan.from_pretrained(
    "the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Diffusers",
    subfolder="vae",
    torch_dtype=torch.float32,
)

What is this model?

Wan2.1-Fun-V1.1-1.3B-Control is a 1.56B parameter video generation model from VideoX-Fun that supports control-signal-guided generation (depth, canny, pose, etc.). The control signal is concatenated with noise and image latents along the channel dimension (in_channels=48 = 16 noise + 16 image + 16 control).

Included components

Component	Class	Size
Transformer	`WanTransformer3DModel`	3.0 GB
Text encoder	`UMT5EncoderModel`	22 GB
Image encoder	`CLIPVisionModel`	1.2 GB
VAE	`AutoencoderKLWan`	485 MB
Tokenizer	`AutoTokenizer` (google/umt5-xxl)	21 MB
Scheduler	`UniPCMultistepScheduler`	config only

Shared components (VAE, text encoder, image encoder, tokenizer, scheduler) are from Wan-AI/Wan2.1-T2V-1.3B-Diffusers and Wan-AI/Wan2.1-I2V-14B-480P-Diffusers.

Conversion details

Transformer weights converted from VideoX-Fun format to diffusers WanTransformer3DModel using key remapping (983/985 tensors).
2 tensors dropped: ref_conv.weight and ref_conv.bias (99,840 params) -- these implement reference-frame token injection which diffusers' WanTransformer3DModel does not support. The model still works for control-to-video tasks.
Forward-pass verified against the original VideoX-Fun model: max absolute difference 1.7e-6 in fp32 (numerical noise from different attention backends).

Limitations

ref_conv dropped: no reference-frame token injection (the model can still do depth/canny/pose control)
No official WanFunControlPipeline in diffusers yet (huggingface/diffusers#12235) -- custom inference code is needed to handle the 48-channel input (concatenating noise + image + control latents)
CLIP image encoder: the original model uses OpenCLIP (xlm-roberta-large-vit-huge-14), which may differ from HuggingFace's CLIPVisionModel included here

Downloads last month: 19

Model tree for the-sweater-cat/Wan2.1-Fun-V1.1-1.3B-Control-Diffusers

Base model

alibaba-pai/Wan2.1-Fun-V1.1-1.3B-Control

Finetuned

(1)

this model