V2M-Zero Achieves 52% Better Video-to-Music Temporal Sync Without Cross-Modal Training

Researchers have developed V2M-Zero, a video-to-music generation system that achieves substantial improvements in temporal synchronization without requiring paired video-music training data. Published on arXiv, the method achieves 21-52% better temporal sync, 13-15% improved semantic alignment, and 5-21% higher audio quality compared to paired-data baselines. The approach validates a key insight: temporal alignment requires matching when and how much change occurs, not what changes.

Event Curves Enable Cross-Modal Alignment Without Paired Data

The breakthrough observation driving V2M-Zero is that while musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. The system captures this temporal structure through event curves computed from intra-modal similarity using pretrained music and video encoders.

Event curves measure temporal change within each modality independently
Curves provide comparable representations across modalities
No cross-modal training or expensive paired datasets required
Method fine-tunes text-to-music model on music-event curves, then substitutes video-event curves at inference

Substantial Performance Gains Across Multiple Benchmarks

Researchers Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, and Nicholas J. Bryan evaluated V2M-Zero across three datasets: OES-Pub, MovieGenBench-Music, and AIST++ (dance videos). Results confirmed via large crowd-sourced subjective listening test showed consistent improvements:

21-52% improved temporal synchronization
13-15% better semantic alignment
5-21% higher audio quality
28% higher beat alignment on AIST++ dance videos

Within-Modality Features Outperform Cross-Modal Supervision

The research demonstrates that temporal alignment through within-modality features is more effective than paired cross-modal supervision for video-to-music generation. This finding has significant implications since paired video-music data is expensive to collect and limits the scale and diversity of training datasets.

The training strategy involves two simple steps: fine-tune a text-to-music model on music-event curves during training, then substitute video-event curves at inference time. This zero-shot transfer approach eliminates the need for aligned video-music pairs while achieving superior temporal synchronization.

Broader Applications Enabled by Data-Free Approach

By removing the requirement for paired training data, V2M-Zero could enable much broader applications in video editing, content creation, and multimedia production. The method's success suggests that temporal structure can be learned and transferred across modalities more effectively through intra-modal self-supervision than through expensive cross-modal alignment.

Project page: https://genjib.github.io/v2m_zero/

Key Takeaways

V2M-Zero achieves 21-52% better temporal sync without paired video-music training data
Method uses event curves to capture temporal structure independently within each modality
Performance improvements include 13-15% better semantic alignment and 5-21% higher audio quality
Approach validated across OES-Pub, MovieGenBench-Music, and AIST++ datasets via crowd-sourced testing
Within-modality features prove more effective than cross-modal supervision for temporal alignment

Event Curves Enable Cross-Modal Alignment Without Paired Data

Event curves measure temporal change within each modality independently

Curves provide comparable representations across modalities

No cross-modal training or expensive paired datasets required

Method fine-tunes text-to-music model on music-event curves, then substitutes video-event curves at inference

Substantial Performance Gains Across Multiple Benchmarks

21-52% improved temporal synchronization

13-15% better semantic alignment

5-21% higher audio quality

28% higher beat alignment on AIST++ dance videos

Within-Modality Features Outperform Cross-Modal Supervision

Broader Applications Enabled by Data-Free Approach

Key Takeaways

V2M-Zero achieves 21-52% better temporal sync without paired video-music training data

Method uses event curves to capture temporal structure independently within each modality

Performance improvements include 13-15% better semantic alignment and 5-21% higher audio quality

Approach validated across OES-Pub, MovieGenBench-Music, and AIST++ datasets via crowd-sourced testing

Within-modality features prove more effective than cross-modal supervision for temporal alignment