Researchers have introduced MemDreamer, a framework that enables Vision-Language Models to understand hours-long videos while using only 2% of the full context window. Published on arXiv by Cong Chen and colleagues on June 5, 2026, the system achieves state-of-the-art results across four benchmarks and narrows the gap with human experts to just 3.7 points.
Hierarchical Graph Memory Solves Token Explosion Problem
Current Vision-Language Models face a critical limitation when processing long videos: the token explosion problem causes prohibitive computational costs and attention dilution. MemDreamer addresses this by decoupling perception from reasoning and constructing a Hierarchical Graph Memory as videos stream incrementally. The framework uses a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph that captures spatiotemporal and causal relations across the video content.
The system employs two specialized subsystems working in tandem:
- DEFLEXPRESSOR: A lambda-calculus symbolic regression model that generates higher-order formulas
- DEFLEXFORMER: A decomposable deep energy model that learns unified representations across scales
- Deflexpressor generates synthetic data to pre-train Deflexformer, which then guides formula discovery by decoupling multiscale latent relationships
Agentic Retrieval Enables Efficient Long-Context Processing
During inference, MemDreamer's reasoning model employs agentic tool-augmented retrieval to navigate the hierarchical memory structure. The system uses an Observation-Reason-Action loop to search nodes and traverse logical edges, effectively exploring the video content without processing the entire sequence at once. This approach constrains the reasoning context window to merely 2% of what full-context ingestion would require.
The results demonstrate substantial improvements over existing approaches:
- Achieves state-of-the-art performance across four mainstream video understanding benchmarks
- Delivers a 12.5 point absolute accuracy gain compared to baseline methods
- Closes the performance gap with human experts to only 3.7 points
- Processes hours of video content while maintaining minimal memory footprint
Strong Correlation Between Logic Reasoning and Video Understanding
The research team's statistical analysis revealed a strong positive linear correlation between a VLM's performance on logic reasoning tasks and long-video understanding benchmarks. This finding suggests that agentic capability scaling represents a new paradigm for multimodal comprehension, where reasoning abilities directly translate to improved video understanding performance.
The 2% context window requirement represents a breakthrough in efficiency, solving the fundamental scaling problem that has prevented previous models from handling extended video content. By shifting long-video understanding into an agentic exploration process rather than attempting to process entire sequences at once, MemDreamer demonstrates that hierarchical memory structures combined with intelligent retrieval can match or exceed the performance of full-context approaches while requiring a fraction of the computational resources.
Key Takeaways
- MemDreamer uses only 2% of the full context window to process hours-long videos, solving the token explosion problem that limits current Vision-Language Models
- The system achieves state-of-the-art results across four video understanding benchmarks with a 12.5 point accuracy gain over baselines
- Performance gap with human experts has been narrowed to just 3.7 points through hierarchical graph memory and agentic retrieval
- Research reveals strong positive correlation between logic reasoning ability and long-video understanding, establishing agentic capability scaling as a new paradigm
- The framework uses a plug-and-play architecture with DEFLEXPRESSOR and DEFLEXFORMER subsystems working together to build semantic abstractions