Researchers have developed V2M-Zero, a video-to-music generation system that achieves substantial improvements in temporal synchronization without requiring paired video-music training data. Published on arXiv, the method achieves 21-52% better temporal sync, 13-15% improved semantic alignment, and 5-21% higher audio quality compared to paired-data baselines. The approach validates a key insight: temporal alignment requires matching when and how much change occurs, not what changes.
Event Curves Enable Cross-Modal Alignment Without Paired Data
The breakthrough observation driving V2M-Zero is that while musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. The system captures this temporal structure through event curves computed from intra-modal similarity using pretrained music and video encoders.
- Event curves measure temporal change within each modality independently
- Curves provide comparable representations across modalities
- No cross-modal training or expensive paired datasets required
- Method fine-tunes text-to-music model on music-event curves, then substitutes video-event curves at inference
Substantial Performance Gains Across Multiple Benchmarks
Researchers Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, and Nicholas J. Bryan evaluated V2M-Zero across three datasets: OES-Pub, MovieGenBench-Music, and AIST++ (dance videos). Results confirmed via large crowd-sourced subjective listening test showed consistent improvements:
- 21-52% improved temporal synchronization
- 13-15% better semantic alignment
- 5-21% higher audio quality
- 28% higher beat alignment on AIST++ dance videos
Within-Modality Features Outperform Cross-Modal Supervision
The research demonstrates that temporal alignment through within-modality features is more effective than paired cross-modal supervision for video-to-music generation. This finding has significant implications since paired video-music data is expensive to collect and limits the scale and diversity of training datasets.
The training strategy involves two simple steps: fine-tune a text-to-music model on music-event curves during training, then substitute video-event curves at inference time. This zero-shot transfer approach eliminates the need for aligned video-music pairs while achieving superior temporal synchronization.
Broader Applications Enabled by Data-Free Approach
By removing the requirement for paired training data, V2M-Zero could enable much broader applications in video editing, content creation, and multimedia production. The method's success suggests that temporal structure can be learned and transferred across modalities more effectively through intra-modal self-supervision than through expensive cross-modal alignment.
Project page: https://genjib.github.io/v2m_zero/
Key Takeaways
- V2M-Zero achieves 21-52% better temporal sync without paired video-music training data
- Method uses event curves to capture temporal structure independently within each modality
- Performance improvements include 13-15% better semantic alignment and 5-21% higher audio quality
- Approach validated across OES-Pub, MovieGenBench-Music, and AIST++ datasets via crowd-sourced testing
- Within-modality features prove more effective than cross-modal supervision for temporal alignment