ComfyUI Now Supports Wan2.1

🧠 What is Wan2.1?

Wan2.1 is a powerful series of open-source video generation models from Alibaba.

The series includes:

Model Type	Resolution	VRAM (approx.)
Text-to-Video 14B (T2V)	480P / 720P	~40GB
Text-to-Video 1.3B (T2V)	480P	~8–15GB
Image-to-Video 14B (I2V)	480P / 720P	~40GB
Visual Text Generation	Multilingual (Chinese/English)	Variable

🔧 Main Features

Consumer-grade Friendly:The T2V 1.3B model can run on GPUs with approximately 8.19 GB of VRAM.
Multi-task Support:Supports T2V (Text-to-Video), I2V (Image-to-Video), V2V (Video-to-Video), T2I (Text-to-Image), V2A (Video-to-Audio).
High Efficiency:The powerful Wan-VAE can process 1080p videos with temporal consistency.
Language Support:The first model to support generating text in both Chinese and English.

📂 Setup Guide

Update ComfyUI to the latest version.
Download the required files and place them in the specified ComfyUI subdirectories:

File Description	Filename (Click to download)	Target Folder
Text Encoder	`umt5_xxl_fp8_e4m3fn_scaled.safetensors`	`ComfyUI/models/text_encoders/`
VAE	`wan_2.1_vae.safetensors`	`ComfyUI/models/vae/`
CLIP Vision (for Image-to-Video)	`clip_vision_h.safetensors`	`ComfyUI/models/clip_vision/`
Video Model (Diffusion Model)	Select from this directory table2_row4_col2_suffix	`ComfyUI/models/diffusion_models/`

Video Model Recommendation:

For best quality, the fp16 version is recommended.
Quality ranking (high to low):fp16 > bf16 > fp8_scaled > fp8_e4m3fn。
If VRAM is insufficient, consider using the fp8 version.

📜 Example Workflows

ComfyUI provides JSON-based workflows. You can find these JSON files in the official ComfyUI examples or documentation. Here are GIF demonstrations of some workflows:

Text to Video (Text to Video)

This workflow can be used with the 1.3B or 14B models. For example, use:

Model file:wan2.1_t2v_1.3B_fp16.safetensors (Place in ComfyUI/models/diffusion_models/)

Output: 480p / 720p (depends on the selected model and settings)

Runtime: Generating a 5-second 480p video with an RTX 4090 takes about 4 minutes.

Workflow Example (1.3B 480p):

Text to Video 1.3B 480P Workflow Example

Workflow Example (14B 720p):

JSON Workflow File:text_to_video_wan.json

Image to Video (Image to Video)

This workflow requires the following files:

Model file (480p):wan2.1_i2v_480p_14B_fp16.safetensors (Place in ComfyUI/models/diffusion_models/)
Model file (720p, optional):wan2.1_i2v_720p_14B_fp16.safetensors (Place in ComfyUI/models/diffusion_models/)
CLIP Vision:clip_vision_h.safetensors (Place in ComfyUI/models/clip_vision/)

Output: 480p (default example: 33 frames @ 512x512) or 720p (if VRAM and hardware allow).

Workflow Example (14B 480p):

Image to Video 14B 480P Workflow Example

Workflow Example (14B 720p):

Image to Video 14B 720P Workflow Example

JSON Workflow File:image_to_video_wan_example.json

📝 Notes

Text Encoder:Required (umt5_xxl_fp8_e4m3fn_scaled.safetensors)。
VRAM Requirement:To run the 480p/720p Image-to-Video model (e.g., 14B I2V) with umt5_xxl_fp8_e4m3fn_scaled.safetensors, you need about 40GB of VRAM.
1.3B T2V Model VRAM:The 1.3B Text-to-Video model requires approximately 15GB of VRAM.
Saving VRAM:Examples typically use 16-bit (fp16) files, but if you are low on VRAM, you can use fp8 versions instead.
720p Models:720p models work well but require higher hardware specifications and patience to run.