April 2026 Open-Source Highlight

OmniShow

An all-in-one model for human-object interaction video generation.

OmniShow, short for 'OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation', is jointly developed by ByteDance, The Chinese University of Hong Kong, Monash University, and The University of Hong Kong. It is the first end-to-end framework with full RAP2V support in one model.

Framework Status
First Full RAP2V

It is currently the first model that supports text + reference image + audio + pose generation in one unified end-to-end framework.

Unified Inputs
Text + Ref + Audio + Pose

The model is designed specifically for HOIVG and aligns four modalities for realistic human-object interaction videos.

Native Shot Length
Up to 10s

It can directly generate continuous long shots up to 10 seconds, reducing multi-stage stitching overhead.

Base Backbone
12B Waver 1.0 (MMDiT)

OmniShow is built on ByteDance's 12B multimodal diffusion transformer stack for high-fidelity conditional video generation.

Background

Release timeline, team, and focus

OmniShow was released as a major open research effort in April 2026, with a clear focus on practical human-object interaction generation under multimodal constraints.

Release timing

The technical report (arXiv:2604.11804) was released around mid-April 2026, and the project entered open-source rollout in April 2026.

Core contributors

Key authors include Donghao Zhou, Guisheng Liu, and Jiatong Li (project lead), with corresponding authors Shilei Wen and Pheng-Ann Heng.

What it targets

The model targets HOIVG use cases such as e-commerce demos, short-form content generation, avatar motion, and other interaction-heavy video workflows.

Generation Modes

Four tasks in one model

A single OmniShow model handles R2V, RA2V, RP2V, and RAP2V in one coherent framework instead of fragmented task-specific pipelines.

R2V: Reference-to-Video

Uses reference image(s) plus text to produce high-fidelity appearance and natural human-object interaction.

RA2V: Reference + Audio-to-Video

Adds audio conditioning to keep identity consistent while aligning motion and expression more tightly to speech or sound.

RP2V: Reference + Pose-to-Video

Uses pose trajectories for stronger motion control while preserving realistic object contact and interaction authenticity.

RAP2V: Ref + Audio + Pose-to-Video

Combines text, reference image, audio, and pose for the strongest multimodal control in complex interaction scenarios.

Technical Innovations

Three key design choices

OmniShow addresses condition fusion, audio-video sync, and heterogeneous training data utilization with three coordinated design strategies.

Unified Channel-wise Conditioning

Injects reference and pose cues through channel-wise pseudo-frame concatenation and reference reconstruction supervision to keep detail fidelity and control strength in balance.

Gated Local-Context Attention

Injects audio with masked local-context attention and adaptive gates, enabling precise sync while reducing multimodal feature conflict.

Decoupled-Then-Joint Training

Trains R2V and A2V specialists first, then fuses weights and joint fine-tunes to unify multimodal capabilities under scarce paired data.

Performance

HOIVG-Bench and practical quality

On HOIVG-Bench (135 curated samples), OmniShow reports strong SOTA-level results across multiple tasks and is the only model covering full RAP2V.

Benchmark scope

The benchmark evaluates text, human/object references, audio, and pose conditions using dedicated multimodal HOIVG protocols.

Metric coverage

Reported metrics include TA, FaceSim, NexusScore, AES, IQA, VQ, MQ, Sync-C, Sync-D, AKD, and PCK to measure fidelity, realism, and alignment.

Qualitative outcomes

Compared with HunyuanCustom, HuMo-17B, VACE, Phantom-14B, and AnchorCrafter, OmniShow shows stronger multimodal alignment and more stable human-object interaction.

Resources

Official links and latest status

The project page already provides rich demos. The repository indicates code is under internal review, with more complete release assets expected later.

Project Page

Gallery and side-by-side demos for R2V, RA2V, RP2V, and RAP2V.

Open Link

GitHub Repository

Official repository and update feed. Code availability is still under internal review.

Open Link

Paper PDF

Technical report: OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation.

Open Link

HOIVG-Bench Dataset

Benchmark dataset for multimodal HOIVG evaluation with aligned text, reference, audio, and pose fields.

Open Link
Applications

Where it can be used

OmniShow is designed for scenarios that require stable identity, realistic object contact, and multimodal controllability in one generation pipeline.

E-commerce and short video

Generates product demo videos with hand-object interaction and presenter motion without full studio shooting.

Content creation

Supports audio-driven talking or singing avatars, with pose guidance for controllable body movement.

Creative interaction

Enables object swapping, remixing, and richer multimodal storytelling for entertainment content.

Education and presentation

Useful for instructional demos, virtual explainers, and scenarios requiring precise human-object interaction.

Why this project matters

OmniShow is a notable open effort in AI video generation because it directly tackles multimodal unification, physical realism, and data-scarce training for HOIVG. If the ecosystem rollout continues, it can lower production cost for interaction-heavy video creation.

© 2026 wan2.video