It is currently the first model that supports text + reference image + audio + pose generation in one unified end-to-end framework.
OmniShow
An all-in-one model for human-object interaction video generation.
OmniShow, short for 'OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation', is jointly developed by ByteDance, The Chinese University of Hong Kong, Monash University, and The University of Hong Kong. It is the first end-to-end framework with full RAP2V support in one model.
The model is designed specifically for HOIVG and aligns four modalities for realistic human-object interaction videos.
It can directly generate continuous long shots up to 10 seconds, reducing multi-stage stitching overhead.
OmniShow is built on ByteDance's 12B multimodal diffusion transformer stack for high-fidelity conditional video generation.
Generated with
OmniShow
Explore high-quality 9:16 portrait videos generated by OmniShow, tailored for modern e-commerce and social media platforms.
Release timeline, team, and focus
OmniShow was released as a major open research effort in April 2026, with a clear focus on practical human-object interaction generation under multimodal constraints.
Release timing
The technical report (arXiv:2604.11804) was released around mid-April 2026, and the project entered open-source rollout in April 2026.
Core contributors
Key authors include Donghao Zhou, Guisheng Liu, and Jiatong Li (project lead), with corresponding authors Shilei Wen and Pheng-Ann Heng.
What it targets
The model targets HOIVG use cases such as e-commerce demos, short-form content generation, avatar motion, and other interaction-heavy video workflows.
Four tasks in one model
A single OmniShow model handles R2V, RA2V, RP2V, and RAP2V in one coherent framework instead of fragmented task-specific pipelines.
R2V: Reference-to-Video
Uses reference image(s) plus text to produce high-fidelity appearance and natural human-object interaction.
RA2V: Reference + Audio-to-Video
Adds audio conditioning to keep identity consistent while aligning motion and expression more tightly to speech or sound.
RP2V: Reference + Pose-to-Video
Uses pose trajectories for stronger motion control while preserving realistic object contact and interaction authenticity.
RAP2V: Ref + Audio + Pose-to-Video
Combines text, reference image, audio, and pose for the strongest multimodal control in complex interaction scenarios.
Three key design choices
OmniShow addresses condition fusion, audio-video sync, and heterogeneous training data utilization with three coordinated design strategies.
Unified Channel-wise Conditioning
Injects reference and pose cues through channel-wise pseudo-frame concatenation and reference reconstruction supervision to keep detail fidelity and control strength in balance.
Gated Local-Context Attention
Injects audio with masked local-context attention and adaptive gates, enabling precise sync while reducing multimodal feature conflict.
Decoupled-Then-Joint Training
Trains R2V and A2V specialists first, then fuses weights and joint fine-tunes to unify multimodal capabilities under scarce paired data.
HOIVG-Bench and practical quality
On HOIVG-Bench (135 curated samples), OmniShow reports strong SOTA-level results across multiple tasks and is the only model covering full RAP2V.
Benchmark scope
The benchmark evaluates text, human/object references, audio, and pose conditions using dedicated multimodal HOIVG protocols.
Metric coverage
Reported metrics include TA, FaceSim, NexusScore, AES, IQA, VQ, MQ, Sync-C, Sync-D, AKD, and PCK to measure fidelity, realism, and alignment.
Qualitative outcomes
Compared with HunyuanCustom, HuMo-17B, VACE, Phantom-14B, and AnchorCrafter, OmniShow shows stronger multimodal alignment and more stable human-object interaction.
Official links and latest status
The project page already provides rich demos. The repository indicates code is under internal review, with more complete release assets expected later.
Project Page
Gallery and side-by-side demos for R2V, RA2V, RP2V, and RAP2V.
Open LinkGitHub Repository
Official repository and update feed. Code availability is still under internal review.
Open LinkPaper PDF
Technical report: OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation.
Open LinkHOIVG-Bench Dataset
Benchmark dataset for multimodal HOIVG evaluation with aligned text, reference, audio, and pose fields.
Open LinkWhere it can be used
OmniShow is designed for scenarios that require stable identity, realistic object contact, and multimodal controllability in one generation pipeline.
E-commerce and short video
Generates product demo videos with hand-object interaction and presenter motion without full studio shooting.
Content creation
Supports audio-driven talking or singing avatars, with pose guidance for controllable body movement.
Creative interaction
Enables object swapping, remixing, and richer multimodal storytelling for entertainment content.
Education and presentation
Useful for instructional demos, virtual explainers, and scenarios requiring precise human-object interaction.
Why this project matters
OmniShow is a notable open effort in AI video generation because it directly tackles multimodal unification, physical realism, and data-scarce training for HOIVG. If the ecosystem rollout continues, it can lower production cost for interaction-heavy video creation.