Video Generation SOTA

Wan2.2 Technical Deep Dive

Next-gen video generation model based on Diffusion Transformer (DiT), integrating Flow Matching and Mixture-of-Experts (MoE).

Architecture Overview

Wan2.2 is an end-to-end video generation system. It abandons the traditional U-Net and adopts the DiT architecture to handle long sequence tokens. It has 27B parameters, but thanks to MoE sparse activation, inference VRAM usage is equivalent to a 14B model.

Unified Input (VCU)

Encodes text, frames, and masks uniformly.

Efficient Spatiotemporal Compression

Wan-VAE achieves 4×8×8 compression rate, 256x reduction.

Multilingual Understanding

Uses umT5 encoder, native bilingual support (English/Chinese).

Data Flow Pipeline

Video Input
Text Prompt
Wan-VAE Encoder
3D Causal Conv
umT5 Encoder
Text Embedding
DiT Core
Flow Matching
MoE Switching
Wan-VAE Dec
Reconstruction

Spatiotemporal VAE: The Art of Compression

Video data contains high redundancy. Wan-VAE achieves simultaneous compression in space and time via 3D convolution.
->

Key Technical Details:

  • Causal 3D Conv: Ensures frame encoding depends only on previous frames.
  • Hybrid Loss: Combination of L1, KL, LPIPS, and GAN losses.
  • Feature Cache: Caches features from previous steps for infinite generation.

Flow Matching Principles

Visualization: Flow Matching vs. Traditional Diffusion

Wan2.2 uses Flow Matching. Its generation trajectory is a straight line (Optimal Transport Path), more efficient and stable than the "random walk" of traditional diffusion.

Input Noise
Output Video
Flow Matching (Straight)
Traditional Diffusion (Random)

Training Objective:

Model predicts velocity vector directly.

Why Flow Matching?

Traditional diffusion (DDPM) simulates SDEs with curved paths. Flow Matching builds a Deterministic ODE from noise to data. Straight paths allow larger step sizes (20-50 steps).

Mathematical Definition

  • 1. Interpolation Path:
  • 2. Vector Field:
  • 3. Loss Function:

Mixture-of-Experts (MoE) Architecture

Wan2.2's MoE is specialized for the time dimension of the denoising process. Early stages (composition) and late stages (details) require distinct capabilities.

Interactive Demo: MoE Dynamic Switching

Total Params: 27B | Active: 14B
Pure Noise (Input) Video (Output)
Early Stage (High Noise) Refinement Stage (Low Noise)
Current Timestep 1.00

Active Model

High Noise Expert

Focus Task

Global Layout & Structure

Switch Threshold
t

Global Layout & Structure

Expert Division of Labor

ComponentHigh Noise ExpertLow Noise Expert
ConditionLow SNR (Early)High SNR (Late)
RoleLarge motion, layout, outlinesTexture, lighting, denoising
TrainingFrom ScratchFine-tuned from Wan2.1

Code Implementation

The following code shows how to use the Hugging Face `diffusers` library to load Wan2.2 and generate video. MoE switching is handled internally.

Python Inference (Diffusers)
import torch
from diffusers import WanPipeline
from diffusers.utils import export_to_video

# 1. Load model (Auto loads MoE weights)
# The following code shows how to use the Hugging Face `diffusers` library to load Wan2.2 and generate video. MoE switching is handled internally.
pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    torch_dtype=torch.bfloat16
)

# 2. Enable CPU offload for VRAM saving
pipe.enable_model_cpu_offload()

# 3. Generate Video (Flow Matching needs ~50 steps)
prompt = "A cinematic drone shot of a futuristic city with flying cars, neon lights, 4k, high quality."
output = pipe(
    prompt=prompt,
    height=720,
    width=1280,
    num_inference_steps=50,
    guidance_scale=5.0
).frames[0]

# 4. Save Result
export_to_video(output, "wan_futuristic_city.mp4")