JD.com Releases JoyAI-Echo: 5-Minute Coherent Audio-Video Generation with 7.5x Speedup

JD.com's Joy Future Academy has released JoyAI-Echo, an open-source audio-visual generation framework that produces synchronized 5-minute videos with consistent character appearance and voice across the entire sequence. The system achieves a 7.5× speed improvement over baseline pipelines through Distribution Matching Distillation, making it the first framework to simultaneously deliver long-range cross-modal consistency, real-time inference for minute-long content, conversational interactivity, and high-resolution output.

Cross-Modal Memory Bank Maintains Character Consistency

The core innovation in JoyAI-Echo is its Cross-Modal Audio-Visual Memory Bank, which preserves both visual character appearance and voice timbre throughout extended video sequences. This addresses a fundamental challenge in long-form generative AI, where models typically struggle to maintain consistency beyond short clips. The system generates 241 frames at 25 fps per shot at 1280 × 736 pixel resolution from a single prompt JSON specification.

Human Evaluation Shows Strong Performance Against Competitors

In human evaluation studies, JoyAI-Echo demonstrated significant advantages over existing systems:

63.6% preference over HappyOyster on visual aesthetics for long-form content
81.7% preference on audio quality
80.6% on prompt adherence
59.4% on character consistency
Superior performance to Wan 2.6 on visual aesthetics for short-form human-centric tasks

The framework includes an interactive conversational agent that enables real-time editing during the generation process, allowing users to refine outputs without starting from scratch.

Technical Architecture Based on Modified LTX-2

JoyAI-Echo builds on a modified version of LTX-2 from Lightricks, optimized for text-to-video generation. The system requires 46-50 GB of GPU memory at peak usage, making it compatible with H100 and A100 GPUs. Currently, the framework supports text-to-video as the primary generation mode, with image-to-video support planned for future releases.

The project's inference-only code and model weights are available on HuggingFace under the jdopensource/JoyAI-Echo repository. Released under an academic and research-use license, the framework represents a significant open-source contribution from JD.com's e-commerce research division into the generative AI space.

Key Takeaways

JoyAI-Echo generates synchronized 5-minute videos at 1280 × 736 resolution with consistent characters and voices throughout
Distribution Matching Distillation achieves 7.5× speedup over baseline audio-visual generation pipelines
Human evaluators preferred JoyAI-Echo over HappyOyster by 63.6% for visual aesthetics and 81.7% for audio quality
The system requires 46-50 GB GPU memory and is compatible with H100/A100 hardware
Released as open-source inference code on GitHub with 673 stars and model weights on HuggingFace for academic use

Cross-Modal Memory Bank Maintains Character Consistency

Human Evaluation Shows Strong Performance Against Competitors

In human evaluation studies, JoyAI-Echo demonstrated significant advantages over existing systems:

63.6% preference over HappyOyster on visual aesthetics for long-form content

81.7% preference on audio quality

80.6% on prompt adherence

59.4% on character consistency

Superior performance to Wan 2.6 on visual aesthetics for short-form human-centric tasks

The framework includes an interactive conversational agent that enables real-time editing during the generation process, allowing users to refine outputs without starting from scratch.

Technical Architecture Based on Modified LTX-2

Key Takeaways

JoyAI-Echo generates synchronized 5-minute videos at 1280 × 736 resolution with consistent characters and voices throughout

Distribution Matching Distillation achieves 7.5× speedup over baseline audio-visual generation pipelines

Human evaluators preferred JoyAI-Echo over HappyOyster by 63.6% for visual aesthetics and 81.7% for audio quality

The system requires 46-50 GB GPU memory and is compatible with H100/A100 hardware

Released as open-source inference code on GitHub with 673 stars and model weights on HuggingFace for academic use