Strong performance in Artificial Analysis Video Arena, ahead of several mainstream closed models.
HappyHorse-1.0
A new open video model that vaulted to the top of the leaderboard almost overnight.
Also written as Happy Horse 1.0, HappyHorse-1.0 is a 15B multimodal text/image-to-video model with native audio generation, strong portrait quality, and a product direction centered on real user preference rather than lab-only metrics.
40 layers with modality-specific projections at both ends and a shared middle stack.
Distilled with DMD-2 to run in 8 denoising steps, with fast audio-video synthesis.
The project surfaced on rankings first and was identified by the community before the full technical drop.
Team, lineage, and product intent
HappyHorse-1.0 is presented as a pragmatic open model effort tied to Alibaba's Taotian ecosystem, with a clear bias toward ecommerce, short-form video, and digital human use cases.
Core team
Led by Zhang Di at Taotian Group Future Life Lab. The lab is described as an evolution of the former ATH-AI innovation unit, with rapid paper output and a focus on multimodal production systems.
Collaborators and predecessor
The project is associated with Sand.ai and GAIR Lab in Shanghai, and iterates on the open daVinci-MagiHuman line released in March 2026.
Why it exists
The stated goal is to optimize for real viewer preference, validate the ceiling of open models, and support later commercial workflows rather than only benchmark demos.
15B unified multimodal stack
The model uses a single-stream self-attention design instead of a split cross-attention pipeline, aiming to simplify conditioning and stabilize multimodal sequence modeling.
40-layer single-stream Transformer
A pure self-attention backbone, with no explicit cross-attention block, is used to model text, video, and audio in one token sequence.
Sandwich modality layout
The first 4 and last 4 layers are modality-specific projections for text, video, and audio, while the middle 32 layers are shared.
Fast inference path
Key efficiency pieces include timestep-free denoising state inference, per-head gating, DMD-2 distillation to 8 steps, and MagiCompiler for roughly 1.2x end-to-end acceleration.
What makes HappyHorse-1.0 stand out
The strongest public reactions focus on synchronized audio-video generation, lip sync quality, portrait realism, and coherent multi-shot outputs.
Text-to-video and image-to-video
Supports prompt-only generation as well as reference image or latent-driven conditioning, with 5 to 12 second clips and multiple aspect ratios.
Native audio generation
Dialogue, ambient sound, and Foley are generated in the same pipeline, reducing the need for post-produced dubbing.
Multi-shot storytelling
A single prompt can drive scene transitions, shot changes, and character continuity across face, clothing, and body shape, with style control hooks such as LoRA presets.
Multilingual lip sync
Public materials mention native support for 7 languages including Mandarin, Cantonese, English, Japanese, Korean, German, and French.
Leaderboard momentum and measured strengths
Public discussion around HappyHorse-1.0 is driven by both ranking results and qualitative reactions from blind tests.
Artificial Analysis Video Arena
Reported as rank #1 for text/image-to-video without audio, rank #2 with audio, and rank #1 for image-to-video without audio, surpassing models like Seedance 2.0, Kling 2.1, Ovi 1.1, and LTX 2.3.
Human preference
Blind voting reports a strong win rate versus Ovi 1.1 and LTX 2.3, reinforcing that the model performs well under user-facing comparisons rather than only internal metrics.
Objective indicators
Public comparisons emphasize visual quality, text alignment, physical consistency, and especially a much lower lip-sync word error rate than several competitors.
Known caveats
Portrait and single-subject videos appear especially strong, while more complex multi-character or high-chaos scenes are still described as a weaker area.
How people are trying it
The model is positioned as both a cloud-first demo experience and an open self-hosted stack once the full repository lands.
Cloud demos
Public-facing pages such as happyhorse.video and happy-horse.art are presented as browser-based entry points with text/image input, HD export, and API-style integration.
Local deployment
The open release is expected to include base model, distilled model, super-resolution module, and inference code. H100-class GPUs are recommended first, with community quantization expected to lower the barrier.
Next expected drop
The near-term roadmap mentions a technical report, watermark or provenance tooling, and auditing mechanisms, with broader community adaptation already starting.
Why this matters
HappyHorse-1.0 is notable because it pushes an open model into direct competition with top closed systems in a user-preference arena. If that momentum holds, it will pressure pricing, speed up fine-tuning and quantization work, and make vertical video production stacks cheaper to build.