Tracing the journey from early blurry, incoherent clips to stunning leaps powered by diffusion models and transformers that can simulate the physical world.
2014 - 2018
Initial attempts used Recurrent Neural Networks (RNN) to predict pixels frame by frame, like 'guessing' subsequent frames. This approach suffered from error accumulation due to long-term dependencies, causing videos to quickly become blurry and distorted.
Frame T
(Real)
Frame T+1
(Pred)
Frame T+1
(Pred)
Frame T+2
(Blurry)
Core Challenge: Error accumulation leads to rapid image degradation.
2018 - 2022
GANs (Generative Adversarial Networks) improved image quality through generator-discriminator competition but struggled with temporal consistency. VAEs (Variational Autoencoders) learned latent data representations but often produced blurry results and mode collapse.
← Optimize Generator
Optimize Discriminator →
Core Challenge: GAN training instability and lack of temporal coherence in videos.
2022 - 2023
To address computational costs, Latent Diffusion Models (LDM) emerged. They cleverly use VAE to compress videos into low-dimensional latent space, perform efficient denoising generation via diffusion models, then decode back to pixel space through VAE. This architecture greatly improved efficiency and practicality.
Significance: Achieved perfect balance between computational efficiency and generation quality.
Early 2024
Represented by Sora, this architecture replaced U-Net with Transformer as the diffusion model backbone. By deconstructing videos into spatio-temporal patches, Transformer's self-attention mechanism captures long-range spatio-temporal dependencies, fundamentally solving temporal coherence challenges.
2. Latent space decomposed into spatio-temporal patches
↓3. Process patch sequences like language
Significance: Treating video generation as 'visual language' sequence modeling, achieving qualitative breakthroughs.
2024 - Present
Technology competition enters new dimensions. Models not only pursue longer generation times but also begin achieving synchronized audio-video generation for more immersive content. Models like Google Veo 3 integrate native audio generation capabilities, marking progress toward complete, immersive multimodal content generation.
Trend: From single-modal generation to collaborative creation of visual, auditory, and multi-sensory content.
Uses Diffusion Transformer (DiT) architecture, aiming to be a 'world simulator' and setting new industry standards in physical realism, long-term coherence, and multimodal capabilities.
Diffusion TransformerCore architecture is Spatio-Temporal U-Net (STUNet), generating entire spatio-temporal volumes at once, pursuing ultimate smoothness and global motion consistency, deeply integrated with Gemini for powerful semantic control.
Spatio-Temporal U-NetAs an industry pioneer, its evolution reflects the trend from 'video transformation' to 'direct creation'. Gen-3 focuses on fine camera control, motion control, and photorealistic human generation.
Multimodal GenerationKnown for user-friendly interface and rapid generation, greatly promoting AI video adoption. The model excels in efficiency, prompt adherence, and creative effects.
Efficient & User-friendlyUses Diffusion Transformer architecture integrated with 3D spatio-temporal attention mechanisms, combining strengths from various approaches to accurately simulate real-world physics and motion laws.
Hybrid ArchitectureFocuses on high-quality generation effects and unique natural language editing capabilities, allowing users to directly modify generated video content through instructions, enhancing controllability.
Natural Language EditingA comprehensive and open video foundation model suite. Its highlight is the ability to run on consumer-grade GPUs and pioneering support for generating bilingual Chinese-English text in videos, greatly enhancing practicality.
Open Source Contribution