JD.com's Joy Future Academy released JoyAI-Echo around June 2, 2026, an inference framework that generates synchronized audio and video for narratives up to 5 minutes long. The system has accumulated 223 GitHub stars and represents a significant advancement in long-form multimodal AI generation, achieving what the team describes as the first system to simultaneously deliver long-range cross-modal consistency, real-time inference for minute-long video, conversational interactivity, and high-resolution output.
The release demonstrates Chinese tech companies pushing boundaries in an area where maintaining consistency across extended sequences has been a major technical challenge. Unlike typical AI video generation tools that produce short clips, JoyAI-Echo can generate coherent 241-frame stories at 25 fps with synchronized audio at 1280×736 pixel resolution.
Cross-Modal Memory Bank Enables Character Consistency
The core innovation of JoyAI-Echo is its cross-modal audio-visual memory bank, which maintains character and narrative consistency across extended sequences. The system achieves a 7.5× speedup through distribution matching distillation (DMD) optimization, making long-form generation computationally feasible.
The joint audio-video pipeline generates synchronized outputs using a distilled few-step inference model. The system supports 5-minute coherent stories and runs on H100 or A100 GPUs with peak usage of 46-50 GB, making it accessible to research institutions with high-end hardware.
Performance Benchmarks Show Strong Human Preference
JoyAI-Echo was evaluated on 100 benchmark stories generating 3,000 evaluation shots, outperforming HappyOyster on long-form generation tasks. Human evaluation results showed strong preference for JoyAI-Echo across multiple dimensions:
- Audio quality: 81.7% preference vs. HappyOyster
- Prompt following: 80.6% preference
- Visual aesthetics: 63.6% preference
- Character consistency: 59.4% preference
These metrics demonstrate particular strength in audio generation and adherence to user prompts, while showing competitive but less dominant performance in character consistency.
Current Release Status and Licensing
The code and model weights are available on HuggingFace under an academic/research-only license, restricting commercial use. Two features remain unreleased: Echo-SR for super-resolution and Director Agent for enhanced control. The system operates at a default resolution of 1280×736 pixels with support for 241 frames at 25 fps.
JD.com, one of China's largest e-commerce platforms, operates the Joy Future Academy as its AI research division led by Dr. Nan Duan, Vice President of JD.COM and Deputy Director of JD Future Academy. The release positions the company as a contributor to cutting-edge multimodal AI research, particularly in the challenging domain of long-form generation where temporal consistency and cross-modal alignment pose significant technical hurdles.
Key Takeaways
- JD.com's Joy Future Academy released JoyAI-Echo in June 2026, achieving the first system to combine long-range cross-modal consistency, real-time inference, conversational interactivity, and high-resolution output
- The system generates up to 5-minute synchronized audio-visual narratives at 1280×736 resolution using a cross-modal memory bank and achieves 7.5× speedup through distribution matching distillation
- Human evaluation showed 81.7% preference for audio quality and 80.6% for prompt following compared to HappyOyster across 100 benchmark stories
- The framework requires 46-50 GB GPU memory (H100/A100) and is released under academic/research-only licensing with code and weights on HuggingFace
- Two features remain unreleased: Echo-SR super-resolution and Director Agent for enhanced control