JD.com's Joy Future Academy has released JoyAI-Echo, an open-source audio-visual generation framework that produces synchronized 5-minute videos with consistent character appearance and voice across the entire sequence. The system achieves a 7.5× speed improvement over baseline pipelines through Distribution Matching Distillation, making it the first framework to simultaneously deliver long-range cross-modal consistency, real-time inference for minute-long content, conversational interactivity, and high-resolution output.
Cross-Modal Memory Bank Maintains Character Consistency
The core innovation in JoyAI-Echo is its Cross-Modal Audio-Visual Memory Bank, which preserves both visual character appearance and voice timbre throughout extended video sequences. This addresses a fundamental challenge in long-form generative AI, where models typically struggle to maintain consistency beyond short clips. The system generates 241 frames at 25 fps per shot at 1280 × 736 pixel resolution from a single prompt JSON specification.
Human Evaluation Shows Strong Performance Against Competitors
In human evaluation studies, JoyAI-Echo demonstrated significant advantages over existing systems:
- 63.6% preference over HappyOyster on visual aesthetics for long-form content
- 81.7% preference on audio quality
- 80.6% on prompt adherence
- 59.4% on character consistency
- Superior performance to Wan 2.6 on visual aesthetics for short-form human-centric tasks
The framework includes an interactive conversational agent that enables real-time editing during the generation process, allowing users to refine outputs without starting from scratch.
Technical Architecture Based on Modified LTX-2
JoyAI-Echo builds on a modified version of LTX-2 from Lightricks, optimized for text-to-video generation. The system requires 46-50 GB of GPU memory at peak usage, making it compatible with H100 and A100 GPUs. Currently, the framework supports text-to-video as the primary generation mode, with image-to-video support planned for future releases.
The project's inference-only code and model weights are available on HuggingFace under the jdopensource/JoyAI-Echo repository. Released under an academic and research-use license, the framework represents a significant open-source contribution from JD.com's e-commerce research division into the generative AI space.
Key Takeaways
- JoyAI-Echo generates synchronized 5-minute videos at 1280 × 736 resolution with consistent characters and voices throughout
- Distribution Matching Distillation achieves 7.5× speedup over baseline audio-visual generation pipelines
- Human evaluators preferred JoyAI-Echo over HappyOyster by 63.6% for visual aesthetics and 81.7% for audio quality
- The system requires 46-50 GB GPU memory and is compatible with H100/A100 hardware
- Released as open-source inference code on GitHub with 673 stars and model weights on HuggingFace for academic use