Wan2.2 Learning - Video Generation SOTA

Architecture Overview

Wan2.2 is an end-to-end video generation system. It abandons the traditional U-Net and adopts the DiT architecture to handle long sequence tokens. It has 27B parameters, but thanks to MoE sparse activation, inference VRAM usage is equivalent to a 14B model.

Unified Input (VCU)

Encodes text, frames, and masks uniformly.

Efficient Spatiotemporal Compression

Wan-VAE achieves 4×8×8 compression rate, 256x reduction.

Multilingual Understanding

Uses umT5 encoder, native bilingual support (English/Chinese).

Data Flow Pipeline

Video Input

Text Prompt

Wan-VAE Encoder

3D Causal Conv

umT5 Encoder

Text Embedding

DiT Core

Flow Matching

MoE Switching

Wan-VAE Dec

Reconstruction

Spatiotemporal VAE: The Art of Compression

Video data contains high redundancy. Wan-VAE achieves simultaneous compression in space and time via 3D convolution.
->

Key Technical Details:

Causal 3D Conv: Ensures frame encoding depends only on previous frames.
Hybrid Loss: Combination of L1, KL, LPIPS, and GAN losses.
Feature Cache: Caches features from previous steps for infinite generation.

Flow Matching Principles

Visualization: Flow Matching vs. Traditional Diffusion

Wan2.2 uses Flow Matching. Its generation trajectory is a straight line (Optimal Transport Path), more efficient and stable than the "random walk" of traditional diffusion.

Input Noise

Output Video

Traditional Diffusion (Random)

Training Objective:

Model predicts velocity vector directly.

Why Flow Matching?

Traditional diffusion (DDPM) simulates SDEs with curved paths. Flow Matching builds a Deterministic ODE from noise to data. Straight paths allow larger step sizes (20-50 steps).

Mathematical Definition

1. Interpolation Path:
2. Vector Field:
3. Loss Function:

Mixture-of-Experts (MoE) Architecture

Wan2.2's MoE is specialized for the time dimension of the denoising process. Early stages (composition) and late stages (details) require distinct capabilities.

Interactive Demo: MoE Dynamic Switching

Total Params: 27B | Active: 14B

Pure Noise (Input) Video (Output)

Early Stage (High Noise) Refinement Stage (Low Noise)

Current Timestep 1.00

Active Model

High Noise Expert

Focus Task

Global Layout & Structure

Switch Threshold

Global Layout & Structure

Expert Division of Labor

Component	High Noise Expert	Low Noise Expert
Condition	Low SNR (Early)	High SNR (Late)
Role	Large motion, layout, outlines	Texture, lighting, denoising
Training	From Scratch	Fine-tuned from Wan2.1

Code Implementation

The following code shows how to use the Hugging Face `diffusers` library to load Wan2.2 and generate video. MoE switching is handled internally.

Python Inference (Diffusers)

import torch
from diffusers import WanPipeline
from diffusers.utils import export_to_video

# 1. Load model (Auto loads MoE weights)
# The following code shows how to use the Hugging Face `diffusers` library to load Wan2.2 and generate video. MoE switching is handled internally.
pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    torch_dtype=torch.bfloat16
)

# 2. Enable CPU offload for VRAM saving
pipe.enable_model_cpu_offload()

# 3. Generate Video (Flow Matching needs ~50 steps)
prompt = "A cinematic drone shot of a futuristic city with flying cars, neon lights, 4k, high quality."
output = pipe(
    prompt=prompt,
    height=720,
    width=1280,
    num_inference_steps=50,
    guidance_scale=5.0
).frames[0]

# 4. Save Result
export_to_video(output, "wan_futuristic_city.mp4")