Researchers from Google and UC Berkeley have developed LoGeR (Long-context Geometric Reconstruction), a neural architecture that achieves dense 3D reconstruction from video sequences up to 19,000 frames long without requiring post-optimization. The system reduces tracking errors by over 74% compared to existing feedforward methods while processing minute-long videos in a single forward pass.
LoGeR Combines Parametric and Non-Parametric Memory for Global Coherence
The core innovation is a hybrid memory module that maintains consistency across extremely long sequences. LoGeR processes videos in chunks while preserving global coherence through two complementary mechanisms:
- Test-Time Training (TTT) memory: A parametric component that anchors the global coordinate frame and prevents scale drift across the entire sequence
- Sliding Window Attention (SWA): A non-parametric mechanism that preserves uncompressed context for high-precision alignment between adjacent chunks
- Bidirectional priors: Strong intra-chunk reasoning that ensures high-fidelity reconstruction within each video segment
This dual-component architecture enables training on sequences of just 128 frames while generalizing to thousands of frames at inference without architectural modifications.
Performance Results Demonstrate Substantial Improvements Over Prior Methods
LoGeR achieves robust performance across standard benchmarks and extended video sequences:
- 74% reduction in Absolute Trajectory Error (ATE) on KITTI dataset compared to state-of-the-art feedforward methods
- 19,000 frames successfully processed on the repurposed VBR dataset, representing unprecedented sequence length
- No post-optimization required: Eliminates need for traditional bundle adjustment or iterative refinement
- Globally consistent reconstruction maintained throughout entire sequences without drift
The research team includes Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun.
Implications for Real-Time 3D Applications
Traditional 3D reconstruction methods face a fundamental tradeoff: feedforward approaches are fast but drift over long sequences, while optimization-based methods are accurate but computationally expensive. LoGeR demonstrates that learned architectures can handle minute-long videos in a single forward pass while maintaining geometric consistency.
The breakthrough addresses quadratic attention complexity and limited effective memory that plague existing geometric foundation models on long sequences. By combining parametric memory for global anchoring with non-parametric attention for local precision, LoGeR provides a scalable path for real-time applications in robotics, AR/VR, and autonomous navigation systems.
Key Takeaways
- LoGeR scales dense 3D reconstruction to 19,000-frame video sequences without post-optimization, trained on just 128-frame clips
- The hybrid memory architecture combines Test-Time Training for global consistency and Sliding Window Attention for local precision
- System achieves 74% reduction in trajectory error on KITTI compared to prior feedforward methods
- Successfully processes minute-long videos in a single forward pass, eliminating need for bundle adjustment
- Collaboration between Google and UC Berkeley demonstrates learned architectures can match optimization-based methods for long-sequence reconstruction