ComfyUI Now Supports Wan2.1

🧠 What is Wan2.1?

Wan2.1 is a powerful series of open-source video generation models from Alibaba.

The series includes:

Model TypeResolutionVRAM (approx.)
Text-to-Video 14B (T2V)480P / 720P~40GB
Text-to-Video 1.3B (T2V)480P~8–15GB
Image-to-Video 14B (I2V)480P / 720P~40GB
Visual Text GenerationMultilingual (Chinese/English)Variable

🔧 Main Features

  • Consumer-grade Friendly:The T2V 1.3B model can run on GPUs with approximately 8.19 GB of VRAM.
  • Multi-task Support:Supports T2V (Text-to-Video), I2V (Image-to-Video), V2V (Video-to-Video), T2I (Text-to-Image), V2A (Video-to-Audio).
  • High Efficiency:The powerful Wan-VAE can process 1080p videos with temporal consistency.
  • Language Support:The first model to support generating text in both Chinese and English.

📂 Setup Guide

  1. Update ComfyUI to the latest version.
  2. Download the required files and place them in the specified ComfyUI subdirectories:
File DescriptionFilename (Click to download)Target Folder
Text Encoderumt5_xxl_fp8_e4m3fn_scaled.safetensorsComfyUI/models/text_encoders/
VAEwan_2.1_vae.safetensorsComfyUI/models/vae/
CLIP Vision (for Image-to-Video)clip_vision_h.safetensorsComfyUI/models/clip_vision/
Video Model (Diffusion Model)Select from this directory table2_row4_col2_suffixComfyUI/models/diffusion_models/

Video Model Recommendation:

  • For best quality, the fp16 version is recommended.
  • Quality ranking (high to low):fp16 > bf16 > fp8_scaled > fp8_e4m3fn
  • If VRAM is insufficient, consider using the fp8 version.

📜 Example Workflows

ComfyUI provides JSON-based workflows. You can find these JSON files in the official ComfyUI examples or documentation. Here are GIF demonstrations of some workflows:

Text to Video (Text to Video)

This workflow can be used with the 1.3B or 14B models. For example, use:

Output: 480p / 720p (depends on the selected model and settings)

Runtime: Generating a 5-second 480p video with an RTX 4090 takes about 4 minutes.

Workflow Example (1.3B 480p):

Text to Video 1.3B 480P Workflow Example

Workflow Example (14B 720p):

Text to Video 14B 720P Workflow Example

JSON Workflow File:text_to_video_wan.json

Image to Video (Image to Video)

This workflow requires the following files:

Output: 480p (default example: 33 frames @ 512x512) or 720p (if VRAM and hardware allow).

Workflow Example (14B 480p):

Image to Video 14B 480P Workflow Example

Workflow Example (14B 720p):

Image to Video 14B 720P Workflow Example

JSON Workflow File:image_to_video_wan_example.json

📝 Notes

  • Text Encoder:Required (umt5_xxl_fp8_e4m3fn_scaled.safetensors)。
  • VRAM Requirement:To run the 480p/720p Image-to-Video model (e.g., 14B I2V) with umt5_xxl_fp8_e4m3fn_scaled.safetensors, you need about 40GB of VRAM.
  • 1.3B T2V Model VRAM:The 1.3B Text-to-Video model requires approximately 15GB of VRAM.
  • Saving VRAM:Examples typically use 16-bit (fp16) files, but if you are low on VRAM, you can use fp8 versions instead.
  • 720p Models:720p models work well but require higher hardware specifications and patience to run.