Next-gen video generation model based on Diffusion Transformer (DiT), integrating Flow Matching and Mixture-of-Experts (MoE).
Wan2.2 is an end-to-end video generation system. It abandons the traditional U-Net and adopts the DiT architecture to handle long sequence tokens. It has 27B parameters, but thanks to MoE sparse activation, inference VRAM usage is equivalent to a 14B model.
Encodes text, frames, and masks uniformly.
Wan-VAE achieves 4×8×8 compression rate, 256x reduction.
Uses umT5 encoder, native bilingual support (English/Chinese).
Video data contains high redundancy. Wan-VAE achieves simultaneous compression in space and time via 3D convolution.
->
Wan2.2 uses Flow Matching. Its generation trajectory is a straight line (Optimal Transport Path), more efficient and stable than the "random walk" of traditional diffusion.
Model predicts velocity vector directly.
Traditional diffusion (DDPM) simulates SDEs with curved paths. Flow Matching builds a Deterministic ODE from noise to data. Straight paths allow larger step sizes (20-50 steps).
Wan2.2's MoE is specialized for the time dimension of the denoising process. Early stages (composition) and late stages (details) require distinct capabilities.
Active Model
High Noise Expert
Focus Task
Global Layout & Structure
Global Layout & Structure
| Component | High Noise Expert | Low Noise Expert |
|---|---|---|
| Condition | Low SNR (Early) | High SNR (Late) |
| Role | Large motion, layout, outlines | Texture, lighting, denoising |
| Training | From Scratch | Fine-tuned from Wan2.1 |
The following code shows how to use the Hugging Face `diffusers` library to load Wan2.2 and generate video. MoE switching is handled internally.
import torch
from diffusers import WanPipeline
from diffusers.utils import export_to_video
# 1. Load model (Auto loads MoE weights)
# The following code shows how to use the Hugging Face `diffusers` library to load Wan2.2 and generate video. MoE switching is handled internally.
pipe = WanPipeline.from_pretrained(
"Wan-AI/Wan2.2-T2V-A14B-Diffusers",
torch_dtype=torch.bfloat16
)
# 2. Enable CPU offload for VRAM saving
pipe.enable_model_cpu_offload()
# 3. Generate Video (Flow Matching needs ~50 steps)
prompt = "A cinematic drone shot of a futuristic city with flying cars, neon lights, 4k, high quality."
output = pipe(
prompt=prompt,
height=720,
width=1280,
num_inference_steps=50,
guidance_scale=5.0
).frames[0]
# 4. Save Result
export_to_video(output, "wan_futuristic_city.mp4")