StoryMem

Multi-shot Long Video Storytelling with Memory

From "random generation" to "director thinking". Through Explicit Visual Memory Banks and Negative RoPE Shift technology, StoryMem solves the "forgetting" problem in multi-shot storytelling, achieving cinematic cross-shot consistency.

Based on Wan2.2 (14B MoE) Architecture

Core Breakthroughs & Evolution

The "Third Way" of Video Generation Technology

Core Pain Point: Multi-shot "Forgetting"

Existing DiT models treat each generation as an independent random process. When cutting from "wide shot" to "close-up", pixel-level features (hairstyle, clothing texture) often mutate, breaking immersion.

Explicit Memory Bank

Beyond high-compute Joint Modeling and simple Auto-regression. StoryMem establishes a dynamic memory module independent of the generation process, maintaining narrative context like human working memory.

Negative RoPE Shift

Singularity: By assigning negative time indices to memory frames, it creates an attention mechanism that is "semantically connected but spatiotemporally disconnected". Perfectly simulates film Hard Cuts, retaining identity while resetting motion.

M2V Architecture Deep Dive

Memory Bank Construction

Based on Wan2.2-I2V + LoRA fine-tuning. Uses intelligent filtering to keep only keyframes from the last 10 active shots, simulating human "working memory" to avoid VRAM exhaustion.

Semantic Keyframe Selection

Introduces CLIP model to calculate cosine similarity. Stores only frames that best reflect current script descriptions (e.g., "angry expression"), discarding blurry or blinking intermediate frames.

Aesthetic Preference Filtering

Integrates HPSv3 scoring model as "AI Art Director". Filters out collapsed frames in real-time, ensuring only high-aesthetic quality material remains in memory.

Latent Space Concatenation

Memory frames are compressed into Latents via 3D VAE, concatenated with current noise vectors in channel dimension, and injected via Rank=128 LoRA adapter.

ST-Bench Authoritative Evaluation

Quantitative assessment based on 300 multi-shot narrative prompts

Cross-shot Consistency

0.5065

9.4% > HoloCine

Based on ViCLIP cosine similarity, accurately quantifying protagonist identity retention.

Aesthetic Quality

0.6133

SOTA Level

LAION aesthetic predictor score, proving memory module did not degrade artistic standards.

Prompt Adherence

0.5337

High Fidelity

Memory mechanism did not interfere with model understanding of new instructions, accurately responding to script changes.

Script-to-Video Storytelling

JSON structured script input, control shots like a director

Dream of Red Mansions: Daiyu Enters

Atmospheric

Scene 1 - 全景

Cut: True

Early spring, outside Jia mansion. Lin Daiyu, in light green Hanfu and sheer cloak, steps down from sedan.

Scene 2 - 中景跟随

Cut: True

Daiyu stops by stone path with plum blossoms. She adjusts cloak, takes deep breath, calm and dignified.

Scene 3 - 特写

Warm indoor light against cold outdoor tones. Close-up on her face and hands, shallow depth of field.

Scene 4 - 互动

Baoyu approaches, young and handsome. He slows down, bows; Daiyu returns bow. Light spills through lattice.

Comparison & Positioning

Compared to models like Open-Sora focused on physics simulation, StoryMem focuses on Cinematic Narrative Logic.

Base Model (Wan2.2)

❌ Random Process: Character appearance mutates with shot changes, model cannot maintain consistent image.

StoryMem (Ours)

✅ Memory Injection: Even under different angles and lighting, structural details (eyes, color) remain highly consistent.

vs. Sora / Open-Sora

Sora tends to generate single continuous long videos. StoryMem pursues Narrative Sequences composed of 5-10s shots, fitting professional film editing pipelines.

vs. AnimateDiff

AnimateDiff excels at stylized loops but creates morphing transitions during "Hard Cuts". StoryMem solves this perfectly via Negative RoPE.

vs. IP-Adapter

IP-Adapter relies on single reference image, prone to collapse in complex dynamics. StoryMem's dynamic memory captures multi-angle features.

StoryMem proves that "Memory" is key to advanced intelligent storytelling. Although currently limited by base model quality (e.g., hand details) and VRAM costs (24GB recommended), it opens the door to "One-Person Film Studios".

Future Outlook: Introducing Audio Memory Banks for voice consistency; exploring infinite generation possibilities for Interactive Movies & Games with real-time rendering.

📄 Read Full Paper (ArXiv)