JD.com Releases JoyAI-Echo for Long-Form Audio-Visual Generation

JD.com's Joy Future Academy released JoyAI-Echo around June 2, 2026, an inference framework that generates synchronized audio and video for narratives up to 5 minutes long. The system has accumulated 223 GitHub stars and represents a significant advancement in long-form multimodal AI generation, achieving what the team describes as the first system to simultaneously deliver long-range cross-modal consistency, real-time inference for minute-long video, conversational interactivity, and high-resolution output.

The release demonstrates Chinese tech companies pushing boundaries in an area where maintaining consistency across extended sequences has been a major technical challenge. Unlike typical AI video generation tools that produce short clips, JoyAI-Echo can generate coherent 241-frame stories at 25 fps with synchronized audio at 1280×736 pixel resolution.

Cross-Modal Memory Bank Enables Character Consistency

The core innovation of JoyAI-Echo is its cross-modal audio-visual memory bank, which maintains character and narrative consistency across extended sequences. The system achieves a 7.5× speedup through distribution matching distillation (DMD) optimization, making long-form generation computationally feasible.

The joint audio-video pipeline generates synchronized outputs using a distilled few-step inference model. The system supports 5-minute coherent stories and runs on H100 or A100 GPUs with peak usage of 46-50 GB, making it accessible to research institutions with high-end hardware.

Performance Benchmarks Show Strong Human Preference

JoyAI-Echo was evaluated on 100 benchmark stories generating 3,000 evaluation shots, outperforming HappyOyster on long-form generation tasks. Human evaluation results showed strong preference for JoyAI-Echo across multiple dimensions:

Audio quality: 81.7% preference vs. HappyOyster
Prompt following: 80.6% preference
Visual aesthetics: 63.6% preference
Character consistency: 59.4% preference

These metrics demonstrate particular strength in audio generation and adherence to user prompts, while showing competitive but less dominant performance in character consistency.

Current Release Status and Licensing

The code and model weights are available on HuggingFace under an academic/research-only license, restricting commercial use. Two features remain unreleased: Echo-SR for super-resolution and Director Agent for enhanced control. The system operates at a default resolution of 1280×736 pixels with support for 241 frames at 25 fps.

JD.com, one of China's largest e-commerce platforms, operates the Joy Future Academy as its AI research division led by Dr. Nan Duan, Vice President of JD.COM and Deputy Director of JD Future Academy. The release positions the company as a contributor to cutting-edge multimodal AI research, particularly in the challenging domain of long-form generation where temporal consistency and cross-modal alignment pose significant technical hurdles.

Key Takeaways

JD.com's Joy Future Academy released JoyAI-Echo in June 2026, achieving the first system to combine long-range cross-modal consistency, real-time inference, conversational interactivity, and high-resolution output
The system generates up to 5-minute synchronized audio-visual narratives at 1280×736 resolution using a cross-modal memory bank and achieves 7.5× speedup through distribution matching distillation
Human evaluation showed 81.7% preference for audio quality and 80.6% for prompt following compared to HappyOyster across 100 benchmark stories
The framework requires 46-50 GB GPU memory (H100/A100) and is released under academic/research-only licensing with code and weights on HuggingFace
Two features remain unreleased: Echo-SR super-resolution and Director Agent for enhanced control

Cross-Modal Memory Bank Enables Character Consistency

Performance Benchmarks Show Strong Human Preference

Audio quality: 81.7% preference vs. HappyOyster

Prompt following: 80.6% preference

Visual aesthetics: 63.6% preference

Character consistency: 59.4% preference

These metrics demonstrate particular strength in audio generation and adherence to user prompts, while showing competitive but less dominant performance in character consistency.

Current Release Status and Licensing

Key Takeaways

JD.com's Joy Future Academy released JoyAI-Echo in June 2026, achieving the first system to combine long-range cross-modal consistency, real-time inference, conversational interactivity, and high-resolution output

The system generates up to 5-minute synchronized audio-visual narratives at 1280×736 resolution using a cross-modal memory bank and achieves 7.5× speedup through distribution matching distillation

Human evaluation showed 81.7% preference for audio quality and 80.6% for prompt following compared to HappyOyster across 100 benchmark stories

The framework requires 46-50 GB GPU memory (H100/A100) and is released under academic/research-only licensing with code and weights on HuggingFace

Two features remain unreleased: Echo-SR super-resolution and Director Agent for enhanced control