LoGeR: Google and Berkeley Researchers Scale 3D Reconstruction to 19,000-Frame Videos

Researchers from Google and UC Berkeley have developed LoGeR (Long-context Geometric Reconstruction), a neural architecture that achieves dense 3D reconstruction from video sequences up to 19,000 frames long without requiring post-optimization. The system reduces tracking errors by over 74% compared to existing feedforward methods while processing minute-long videos in a single forward pass.

LoGeR Combines Parametric and Non-Parametric Memory for Global Coherence

The core innovation is a hybrid memory module that maintains consistency across extremely long sequences. LoGeR processes videos in chunks while preserving global coherence through two complementary mechanisms:

Test-Time Training (TTT) memory: A parametric component that anchors the global coordinate frame and prevents scale drift across the entire sequence
Sliding Window Attention (SWA): A non-parametric mechanism that preserves uncompressed context for high-precision alignment between adjacent chunks
Bidirectional priors: Strong intra-chunk reasoning that ensures high-fidelity reconstruction within each video segment

This dual-component architecture enables training on sequences of just 128 frames while generalizing to thousands of frames at inference without architectural modifications.

Performance Results Demonstrate Substantial Improvements Over Prior Methods

LoGeR achieves robust performance across standard benchmarks and extended video sequences:

74% reduction in Absolute Trajectory Error (ATE) on KITTI dataset compared to state-of-the-art feedforward methods
19,000 frames successfully processed on the repurposed VBR dataset, representing unprecedented sequence length
No post-optimization required: Eliminates need for traditional bundle adjustment or iterative refinement
Globally consistent reconstruction maintained throughout entire sequences without drift

The research team includes Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun.

Implications for Real-Time 3D Applications

Traditional 3D reconstruction methods face a fundamental tradeoff: feedforward approaches are fast but drift over long sequences, while optimization-based methods are accurate but computationally expensive. LoGeR demonstrates that learned architectures can handle minute-long videos in a single forward pass while maintaining geometric consistency.

The breakthrough addresses quadratic attention complexity and limited effective memory that plague existing geometric foundation models on long sequences. By combining parametric memory for global anchoring with non-parametric attention for local precision, LoGeR provides a scalable path for real-time applications in robotics, AR/VR, and autonomous navigation systems.

Key Takeaways

LoGeR scales dense 3D reconstruction to 19,000-frame video sequences without post-optimization, trained on just 128-frame clips
The hybrid memory architecture combines Test-Time Training for global consistency and Sliding Window Attention for local precision
System achieves 74% reduction in trajectory error on KITTI compared to prior feedforward methods
Successfully processes minute-long videos in a single forward pass, eliminating need for bundle adjustment
Collaboration between Google and UC Berkeley demonstrates learned architectures can match optimization-based methods for long-sequence reconstruction

LoGeR Combines Parametric and Non-Parametric Memory for Global Coherence

Test-Time Training (TTT) memory: A parametric component that anchors the global coordinate frame and prevents scale drift across the entire sequence

Sliding Window Attention (SWA): A non-parametric mechanism that preserves uncompressed context for high-precision alignment between adjacent chunks

Bidirectional priors: Strong intra-chunk reasoning that ensures high-fidelity reconstruction within each video segment

This dual-component architecture enables training on sequences of just 128 frames while generalizing to thousands of frames at inference without architectural modifications.

Performance Results Demonstrate Substantial Improvements Over Prior Methods

LoGeR achieves robust performance across standard benchmarks and extended video sequences:

74% reduction in Absolute Trajectory Error (ATE) on KITTI dataset compared to state-of-the-art feedforward methods

19,000 frames successfully processed on the repurposed VBR dataset, representing unprecedented sequence length

No post-optimization required: Eliminates need for traditional bundle adjustment or iterative refinement

Globally consistent reconstruction maintained throughout entire sequences without drift

The research team includes Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun.

Implications for Real-Time 3D Applications

Key Takeaways

LoGeR scales dense 3D reconstruction to 19,000-frame video sequences without post-optimization, trained on just 128-frame clips

The hybrid memory architecture combines Test-Time Training for global consistency and Sliding Window Attention for local precision

System achieves 74% reduction in trajectory error on KITTI compared to prior feedforward methods

Successfully processes minute-long videos in a single forward pass, eliminating need for bundle adjustment

Collaboration between Google and UC Berkeley demonstrates learned architectures can match optimization-based methods for long-sequence reconstruction