April 2026 Open-Source Breakout

HappyHorse-1.0

A new open video model that vaulted to the top of the leaderboard almost overnight.

Also written as Happy Horse 1.0, HappyHorse-1.0 is a 15B multimodal text/image-to-video model with native audio generation, strong portrait quality, and a product direction centered on real user preference rather than lab-only metrics.

See Ranking Snapshot Check Access Options

Arena Rank

#1 silent, #2 with audio

Strong performance in Artificial Analysis Video Arena, ahead of several mainstream closed models.

Core Model

15B single-stream Transformer

40 layers with modality-specific projections at both ends and a shared middle stack.

Generation Speed

5s 256p in about 2s

Distilled with DMD-2 to run in 8 denoising steps, with fast audio-video synthesis.

Release Style

Anonymous climb, then reveal

The project surfaced on rankings first and was identified by the community before the full technical drop.

Background

Team, lineage, and product intent

HappyHorse-1.0 is presented as a pragmatic open model effort tied to Alibaba's Taotian ecosystem, with a clear bias toward ecommerce, short-form video, and digital human use cases.

Core team

Led by Zhang Di at Taotian Group Future Life Lab. The lab is described as an evolution of the former ATH-AI innovation unit, with rapid paper output and a focus on multimodal production systems.

Collaborators and predecessor

The project is associated with Sand.ai and GAIR Lab in Shanghai, and iterates on the open daVinci-MagiHuman line released in March 2026.

Why it exists

The stated goal is to optimize for real viewer preference, validate the ceiling of open models, and support later commercial workflows rather than only benchmark demos.

Architecture

15B unified multimodal stack

The model uses a single-stream self-attention design instead of a split cross-attention pipeline, aiming to simplify conditioning and stabilize multimodal sequence modeling.

40-layer single-stream Transformer

A pure self-attention backbone, with no explicit cross-attention block, is used to model text, video, and audio in one token sequence.

Sandwich modality layout

The first 4 and last 4 layers are modality-specific projections for text, video, and audio, while the middle 32 layers are shared.

Fast inference path

Key efficiency pieces include timestep-free denoising state inference, per-head gating, DMD-2 distillation to 8 steps, and MagiCompiler for roughly 1.2x end-to-end acceleration.

Capabilities

What makes HappyHorse-1.0 stand out

The strongest public reactions focus on synchronized audio-video generation, lip sync quality, portrait realism, and coherent multi-shot outputs.

Text-to-video and image-to-video

Supports prompt-only generation as well as reference image or latent-driven conditioning, with 5 to 12 second clips and multiple aspect ratios.

Native audio generation

Dialogue, ambient sound, and Foley are generated in the same pipeline, reducing the need for post-produced dubbing.

Multi-shot storytelling

A single prompt can drive scene transitions, shot changes, and character continuity across face, clothing, and body shape, with style control hooks such as LoRA presets.

Multilingual lip sync

Public materials mention native support for 7 languages including Mandarin, Cantonese, English, Japanese, Korean, German, and French.

Evaluation

Leaderboard momentum and measured strengths

Public discussion around HappyHorse-1.0 is driven by both ranking results and qualitative reactions from blind tests.

Artificial Analysis Video Arena

Reported as rank #1 for text/image-to-video without audio, rank #2 with audio, and rank #1 for image-to-video without audio, surpassing models like Seedance 2.0, Kling 2.1, Ovi 1.1, and LTX 2.3.

Human preference

Blind voting reports a strong win rate versus Ovi 1.1 and LTX 2.3, reinforcing that the model performs well under user-facing comparisons rather than only internal metrics.

Objective indicators

Public comparisons emphasize visual quality, text alignment, physical consistency, and especially a much lower lip-sync word error rate than several competitors.

Known caveats

Portrait and single-subject videos appear especially strong, while more complex multi-character or high-chaos scenes are still described as a weaker area.

Access

How people are trying it

The model is positioned as both a cloud-first demo experience and an open self-hosted stack once the full repository lands.

Cloud demos

Public-facing pages such as happyhorse.video and happy-horse.art are presented as browser-based entry points with text/image input, HD export, and API-style integration.

Local deployment

The open release is expected to include base model, distilled model, super-resolution module, and inference code. H100-class GPUs are recommended first, with community quantization expected to lower the barrier.

Next expected drop

The near-term roadmap mentions a technical report, watermark or provenance tooling, and auditing mechanisms, with broader community adaptation already starting.

Why this matters

HappyHorse-1.0 is notable because it pushes an open model into direct competition with top closed systems in a user-preference arena. If that momentum holds, it will pressure pricing, speed up fine-tuning and quantization work, and make vertical video production stacks cheaper to build.