Generate cinematic 1080p videos with synchronized audio — directly in your browser. No GPU. No installation. No editing skills required. Powered by Happy Horse 1.0, the world's first open-source SOTA AI video model with native joint audio-video generation.
Try Free • What is HappyHor.se • Features • Architecture • Benchmarks • Use Cases • API • FAQ
🌐 happyhor.se · 简体中文: happyhor.se/zh/
HappyHor.se is a web-based AI video generation platform built on top of Happy Horse 1.0 — a 15-billion-parameter unified self-attention Transformer that produces 1080p cinematic video and synchronized audio in a single forward pass.
Think of it as the easiest way to use Happy Horse 1.0 without any local setup: write a text prompt (or upload an image), choose your aspect ratio and language, and get a production-ready MP4 in under a minute — with dialogue, ambient sound, and Foley effects already baked in.
Unlike other AI video tools that generate silent clips and rely on separate models for dubbing and lip-sync, Happy Horse 1.0 processes every modality — text, image, video, and audio — in one unified token sequence. The result is audio that is naturally aligned to mouth shapes at the phoneme level, footsteps that land on the right frames, and ambient noise that responds to camera cuts — all without a single line of post-production code.
HappyHor.se is the simplest entry point to this technology:
- No GPU, no Python, no CUDA — just a browser
- Free credits to try text-to-video and image-to-video
- Paid plans for commercial work, longer clips, and API access
- The same Happy Horse 1.0 model you can self-host via happyhor.se
Generate HD video at full 1080p resolution, 24fps, in 5 to 8-second clips. Supported aspect ratios: 16:9 (landscape), 9:16 (portrait/short-form), and 1:1 (square). Output is tuned for cinematic motion: stable camera moves, coherent physics, and none of the "morphing" or "glitching" that plagues most diffusion video models.
The headline capability. A single unified Transformer denoises video tokens and audio tokens together in the same sequence — no separate audio branch, no post-production dubbing. Dialogue, ambient sound, and Foley effects are generated simultaneously with the visual content and naturally synchronized at the frame level.
Native lip-sync support for English, Mandarin, Cantonese, Japanese, Korean, German, and French, with an industry-leading Word Error Rate of just 14.60%. Speech timing, prosody, and mouth shapes are learned jointly with the video — not bolted on afterward.
Happy Horse 1.0 uses DMD-2 (Distribution Matching Distillation v2) to reduce sampling from the typical 25–50 steps down to just 8 steps, with no classifier-free guidance required. Combined with the MagiCompiler full-graph compilation runtime (~1.2× additional speedup), a 1080p clip is ready in roughly 38 seconds on an H100. On HappyHor.se, the same computation runs on cloud infrastructure — no wait, no queue.
A single model handles both workflows:
- Text-to-video — describe any scene in natural language
- Image-to-video — upload a still photo or illustration and animate it
Style, subject identity, and physical realism remain consistent regardless of input type.
HappyHor.se is fully localized. Current languages:
Happy Horse 1.0 makes a deliberately minimalist architectural choice: instead of stacking multiple specialized networks for video, audio, and conditioning, everything goes into one unified Transformer. The simplicity is the point — fewer moving parts, fewer alignment failures.
| Component | Specification |
|---|---|
| Total parameters | ~15B |
| Architecture | Unified self-attention Transformer (no cross-attention) |
| Total layers | 40 |
| Layer layout | Sandwich: first 4 + last 4 layers use modality-specific projections; middle 32 layers share parameters across all modalities |
| Modalities | Text, image, video, and audio tokens — processed in a single token sequence |
| Multimodal fusion | Per-attention-head learned scalar gates with sigmoid activation (for training stability) |
| Conditioning | Reference image + denoising signal through one unified interface — no dedicated conditioning branches |
| Timestep handling | No explicit timestep embeddings — the model infers denoising state directly from input noise level |
| Distillation | DMD-2 (Distribution Matching Distillation v2) |
| Sampling steps | 8 |
| Classifier-free guidance | Not required |
| Inference runtime | MagiCompiler (full-graph compilation, ~1.2× end-to-end speedup) |
| Reference hardware | NVIDIA H100 80GB |
Most open-source video models — Wan 2.2, HunyuanVideo, LTX-2, CogVideoX — use a Diffusion Transformer (DiT) backbone where text conditioning is injected via cross-attention from a separate text encoder, and audio (when present) is produced by a completely separate model afterward.
This architecture works, but it creates an alignment problem: the audio model has no visibility into what the video model "saw" at each frame, so lip-sync becomes a downstream correction step rather than something the model learns intrinsically.
Happy Horse 1.0 takes the opposite approach. Text tokens, a reference-image latent, and noisy video and audio tokens are concatenated into a single sequence and jointly denoised by self-attention. Every attention layer can attend to every modality at every layer. The model learns alignment as part of denoising, not as a bolted-on post-process.
The first 4 and last 4 layers use modality-specific projections so each input type can be cleanly embedded and decoded. The middle 32 layers share parameters across all modalities — this is where cross-modal reasoning happens. The result: 32 of the 40 layers simultaneously act as the video Transformer, the audio Transformer, and the text-to-video aligner.
Joint multimodal training is notoriously unstable — gradients from the audio loss can dominate or be drowned out by gradients from the video loss. Happy Horse 1.0 uses a learned scalar gate with sigmoid activation on each attention head, allowing the model to effectively suppress heads producing destructive cross-modal gradients during training. A small architectural detail that pays off as a dramatically smoother loss curve.
Diffusion models traditionally embed the current denoising timestep into every layer as an explicit signal. Happy Horse 1.0 omits this entirely. Because the noise level is already encoded in the noisy latents themselves, the model reads denoising state directly from its inputs. This simplification is one of the prerequisites for the aggressive DMD-2 8-step distillation.
Happy Horse 1.0 is evaluated on the Artificial Analysis Video Arena, the industry-standard public leaderboard for AI video models. The arena uses blind head-to-head comparisons — users see two videos from the same prompt without knowing which model produced which, then vote on their preference. Rankings are computed using Elo.
- 📈 Text-to-Video Leaderboard
- 🖼️ Image-to-Video Leaderboard
- 🥊 Video Arena — vote yourself
| Model | Visual Quality | Prompt Alignment | Physical Realism | WER ↓ | Elo |
|---|---|---|---|---|---|
| OVI 1.1 | 4.73 | 4.10 | 4.41 | 40.45% | — |
| LTX 2.3 | 4.76 | 4.12 | 4.56 | 19.23% | — |
| Happy Horse 1.0 | 4.80 | 4.18 | 4.52 | 14.60% | 1333 |
Win rates: 80.0% vs OVI 1.1 · 60.9% vs LTX 2.3
| Tier | Models |
|---|---|
| Frontier closed models (Elo 1,200–1,275) | Dreamina Seedance 2.0, SkyReels V4, Kling 3.0, PixVerse V6, Veo 3.1, Runway Gen-4.5 |
| Mid-tier closed models (Elo 1,150–1,200) | Sora 2 Pro, Hailuo 2.3, Wan 2.6, Vidu Q2 |
| Top open-weights models (Elo 1,100–1,135) | LTX-2 Pro, LTX-2 Fast, Wan 2.2 A14B |
| Happy Horse 1.0 (Elo 1,333) | 🏆 Above the open-source tier — competing with frontier closed models |
Happy Horse 1.0 first appeared on the leaderboard as an anonymous "mystery model" and drew community attention for its motion quality, audio synchronization, and prompt adherence before its identity was revealed. It is currently the highest-ranked open-source video generation model on the Artificial Analysis leaderboard.
No installation required. Visit happyhor.se and start generating immediately with free credits.
- Go to happyhor.se
- Type a prompt in the text box — e.g. "a lone astronaut walks across a red Martian landscape, cinematic wide shot, golden hour light"
- Select duration (5s or 8s), aspect ratio, and lip-sync language if needed
- Click Generate Video and wait ~38 seconds
- Download your 1080p MP4
- Switch to the Image to Video tab
- Upload any still image (portrait, product photo, illustration, etc.)
- Optionally add a motion prompt — e.g. "subject begins speaking warmly, gentle camera push-in"
- Select settings and click Generate
HappyHor.se exposes the Happy Horse 1.0 model via REST API for programmatic use. See happyhor.se/api for documentation.
curl -X POST https://api.happyhor.se/v1/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "a robot dancing on the moon, cinematic, 4K",
"duration": 5,
"aspect_ratio": "16:9",
"resolution": "1080p",
"lip_sync_language": "en"
}'For teams who want full control — on-premise deployment, custom fine-tuning, or integration into an existing pipeline — the underlying model is available via the open-source release at happyhor.se.
# Clone & install
git clone https://github.com/happy-horse/happyhorse-1.git
cd happyhorse-1
pip install -r requirements.txt
# Download weights
bash download_weights.sh
# Generate
python demo_generate.py \
--prompt "a robot dancing on the moon" \
--duration 5 \
--resolution 1080pfrom happyhorse import HappyHorseModel
model = HappyHorseModel.from_pretrained("happy-horse/happyhorse-1.0")
video, audio = model.generate(
prompt="an elder on a mountain peak overlooking the valley at dawn",
duration_seconds=5,
fps=24,
language="en",
)
video.save("output.mp4")
audio.save("output.wav")| Tier | GPU | VRAM | Notes |
|---|---|---|---|
| Recommended | NVIDIA H100 80GB | 80 GB | ~38s per 1080p clip |
| Workable | NVIDIA A100 80GB | 80 GB | Full quality, slightly slower |
| Consumer (≥48GB) | RTX 6000 Ada / A6000 | 48 GB | Distilled model recommended |
| Consumer (24GB) | RTX 4090 | 24 GB | FP8 quantization + lower resolution |
| Feature | Happy Horse 1.0 | LTX-2 Pro | Wan 2.2 A14B | HunyuanVideo-1.5 | CogVideoX-5B |
|---|---|---|---|---|---|
| Parameters | 15B | ~13B | 14B | ~13B | 5B |
| Architecture | Unified self-attention | DiT | DiT | DiT | DiT |
| Native audio generation | ✅ Joint | ❌ | ❌ | ❌ | ❌ |
| Native lip-sync | 7 languages | None | None | None | None |
| Sampling steps | 8 (no CFG) | ~25 | ~50 | ~50 | ~50 |
| 1080p generation time | ~38s on H100 | Minutes | Minutes | Minutes | Minutes |
| Text-to-video | ✅ | ✅ | ✅ | ✅ | ✅ |
| Image-to-video | ✅ | ✅ | ✅ | ✅ | ✅ |
| Open weights | ✅ | ✅ | ✅ | ✅ | ✅ |
| Commercial use | ✅ | ✅ | ✅ | ✅ | ✅ |
| Feature | HappyHor.se / Happy Horse 1.0 | Sora 2 | Veo 3.1 | Kling 3.0 | Dreamina Seedance 2.0 |
|---|---|---|---|---|---|
| Open weights available | ✅ | ❌ | ❌ | ❌ | ❌ |
| Self-hostable | ✅ | ❌ | ❌ | ❌ | ❌ |
| Fine-tuning supported | ✅ | ❌ | ❌ | ❌ | ❌ |
| Native joint audio | ✅ | ❌ | ✅ | ❌ | ✅ |
| Native lip-sync | 7 languages | Limited | Limited | None | Limited |
| Pricing model | Free tier + subscription | API credits | API credits | API credits | API credits |
| Data stays on your infra | ✅ (self-hosted) | ❌ | ❌ | ❌ | ❌ |
Sora, Veo, Kling, and Seedance are closed, API-only services: you pay per minute, you cannot self-host or inspect the model, and your prompts and outputs pass through a third-party server. Happy Horse 1.0 is released as open weights — download once, run forever on your own infrastructure with no per-clip fee.
Generate scroll-stopping vertical video for TikTok, Instagram Reels, YouTube Shorts, Douyin, and Xiaohongshu — with native sound — in under a minute per clip. No separate dubbing, no lip-sync tooling, no third-party music licensing for ambient audio. The 9:16 output format is natively supported.
Build launch trailers, product teasers, and high-converting ad creatives with cinematic motion that looks directed rather than synthesized. Iterate quickly across multiple creative angles. The fast generation time (38s) enables real-time creative testing at a scale that no human production team can match.
Native lip-sync in 7 languages means you produce one creative concept and ship it across English, Mandarin, Cantonese, Japanese, Korean, German, and French markets simultaneously — without re-shooting or paying human dubbing studios.
Create B-roll, concept shots, and stylized sequences for film, TV, and YouTube production. Use as drop-in footage for editors or as previz animatics for directors before a physical shoot.
Turn product photos into photorealistic motion demos with image-to-video. Show packaging reveals, device demos, and lifestyle scenes with stable camera movement and accurate physics — before any physical shoot is scheduled.
Multi-shot scenes with dialogue, music, and Foley generated together. The unified architecture maintains better subject and lighting consistency across shots than pipelines stitching multiple models together.
Via the REST API, integrate Happy Horse 1.0 into internal content platforms, headless CMS workflows, or automated marketing systems. Contact happyhor.se for Enterprise plan details and SLA options.
The open Happy Horse 1.0 weights serve as a reference implementation for studying joint audio-video diffusion, DMD-2 distillation, unified multimodal Transformers, and timestep-free denoising. The architecture is unusually clean — 40 layers, no cross-attention, one token sequence — making it tractable to study and modify.
| Plan | Price | Videos / month | Resolution | Audio | Lip-Sync | API | Watermark |
|---|---|---|---|---|---|---|---|
| Free | $0 | 5 | 720p | ✅ | ❌ | ❌ | Yes |
| Pro | $19 / mo | 100 | 1080p | ✅ | ✅ 7 languages | ❌ | No |
| Enterprise | Custom | Unlimited | 1080p | ✅ | ✅ 7 languages | ✅ | No |
All plans generate with the full Happy Horse 1.0 model. No quality difference between tiers — just generation volume, resolution cap, and feature access. Visit happyhor.se/#pricing for full details.
The HappyHor.se API is available on Enterprise plans and provides programmatic access to the Happy Horse 1.0 model for text-to-video and image-to-video generation.
Base URL: https://api.happyhor.se/v1
| Method | Endpoint | Description |
|---|---|---|
POST |
/generate |
Submit a generation job |
GET |
/jobs/{id} |
Poll job status |
GET |
/jobs/{id}/download |
Download completed video |
GET |
/models |
List available model versions |
Full API documentation: happyhor.se/api
HappyHor.se is a web platform for generating AI video powered by the Happy Horse 1.0 model. You write a prompt (or upload an image), and get back a 1080p video with synchronized audio — no GPU, no installation, no editing skills needed. Free credits are available at launch.
Happy Horse 1.0 is the underlying open-source model — a 15B-parameter unified self-attention Transformer that jointly generates video and audio in a single forward pass. It is the world's first open-source SOTA video model with native audio generation. More technical detail: happyhor.se.
Roughly 38 seconds for a 5-second 1080p clip on a single NVIDIA H100. On HappyHor.se, generation runs on cloud infrastructure — no cold start, no queue for Pro users.
7 languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. Word Error Rate: 14.60% — the lowest among comparable open-source models. For text prompts in other languages, the model will still generate, but lip-sync accuracy is only guaranteed for the seven trained languages.
Yes. Videos generated on Pro and Enterprise plans can be used for commercial purposes — advertising, YouTube monetization, client work, product demos — in accordance with HappyHor.se's Terms of Service. The underlying Happy Horse 1.0 model is also open source with explicit commercial-use rights.
Those are closed API services with per-minute pricing and no self-hosting option. HappyHor.se gives you access to an open-source model that you can also run on your own hardware. Architecturally, Happy Horse 1.0 generates audio jointly with video in a single forward pass — most other models treat audio as a separate, optional feature.
The biggest difference is native joint audio generation: Wan, HunyuanVideo, and LTX-2 produce silent video and require entirely separate models for sound and lip-sync. Happy Horse 1.0 generates both in the same Transformer in one pass. It also uses pure self-attention (no cross-attention) and DMD-2 distillation for 8-step sampling without classifier-free guidance.
Yes — REST API access is available on Enterprise plans. See happyhor.se/api. For self-hosted API usage, deploy the open model via happyhor.se.
Yes. The Happy Horse 1.0 model is fully open source and available at happyhor.se. HappyHor.se is the managed hosted version — choose whichever fits your infrastructure.
On HappyHor.se, prompts and outputs are processed on our servers in accordance with our Privacy Policy. For complete data sovereignty, self-host the model on your own infrastructure via happyhor.se.
Happy Horse 1.0 is capable of generating photorealistic video and synchronized speech in seven languages. This capability comes with real responsibility. By using HappyHor.se or the underlying model, you agree not to:
- Generate non-consensual likenesses of real people — including public figures, private individuals, or deceased persons
- Create deceptive political content, fabricated news footage, or impersonations intended to mislead
- Generate content involving minors in any sexual or harmful context, or any content prohibited by applicable law
- Use the model to circumvent detection of AI-generated content in contexts where disclosure is legally or ethically required
You should:
- Clearly disclose AI-generated content in journalism, advertising, education, and any context where authenticity is expected
- Respect copyright, trademarks, and personal likeness rights in your prompts and outputs
A complete Acceptable Use Policy is available at happyhor.se/terms.
| Resource | URL |
|---|---|
| 🌐 HappyHor.se (English) | happyhor.se |
| 🇨🇳 HappyHor.se (简体中文) | happyhor.se/zh/ |
| 🐎 Happy Horse model site | happyhor.se |
| 📖 API Documentation | happyhor.se/api |
| 🤗 Hugging Face | huggingface.co/happy-horse |
| 💻 GitHub (model) | github.com/happy-horse/happyhorse-1 |
| 📊 Video Arena | artificialanalysis.ai/video/arena |
| 📈 T2V Leaderboard | artificialanalysis.ai/video/leaderboard/text-to-video |
If you use Happy Horse 1.0 in research or production, please cite the technical report (to be published with the model release):
@misc{happyhorse2026,
title = {Happy Horse 1.0: An Open-Source Unified Self-Attention Transformer for Joint Audio-Video Generation},
author = {Happy Horse Team},
year = {2026},
url = {https://happyhor.se/},
note = {Web platform: https://happyhor.se/}
}
🐎 HappyHor.se — cinematic AI video, with sound, in seconds.
happyhor.se ·
简体中文 ·
happyhor.se