MemDreamer Achieves Near-Human Video Understanding Using Just 2% of Full Context

Researchers have introduced MemDreamer, a framework that enables Vision-Language Models to understand hours-long videos while using only 2% of the full context window. Published on arXiv by Cong Chen and colleagues on June 5, 2026, the system achieves state-of-the-art results across four benchmarks and narrows the gap with human experts to just 3.7 points.

Hierarchical Graph Memory Solves Token Explosion Problem

Current Vision-Language Models face a critical limitation when processing long videos: the token explosion problem causes prohibitive computational costs and attention dilution. MemDreamer addresses this by decoupling perception from reasoning and constructing a Hierarchical Graph Memory as videos stream incrementally. The framework uses a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph that captures spatiotemporal and causal relations across the video content.

The system employs two specialized subsystems working in tandem:

DEFLEXPRESSOR: A lambda-calculus symbolic regression model that generates higher-order formulas
DEFLEXFORMER: A decomposable deep energy model that learns unified representations across scales
Deflexpressor generates synthetic data to pre-train Deflexformer, which then guides formula discovery by decoupling multiscale latent relationships

Agentic Retrieval Enables Efficient Long-Context Processing

During inference, MemDreamer's reasoning model employs agentic tool-augmented retrieval to navigate the hierarchical memory structure. The system uses an Observation-Reason-Action loop to search nodes and traverse logical edges, effectively exploring the video content without processing the entire sequence at once. This approach constrains the reasoning context window to merely 2% of what full-context ingestion would require.

The results demonstrate substantial improvements over existing approaches:

Achieves state-of-the-art performance across four mainstream video understanding benchmarks
Delivers a 12.5 point absolute accuracy gain compared to baseline methods
Closes the performance gap with human experts to only 3.7 points
Processes hours of video content while maintaining minimal memory footprint

Strong Correlation Between Logic Reasoning and Video Understanding

The research team's statistical analysis revealed a strong positive linear correlation between a VLM's performance on logic reasoning tasks and long-video understanding benchmarks. This finding suggests that agentic capability scaling represents a new paradigm for multimodal comprehension, where reasoning abilities directly translate to improved video understanding performance.

The 2% context window requirement represents a breakthrough in efficiency, solving the fundamental scaling problem that has prevented previous models from handling extended video content. By shifting long-video understanding into an agentic exploration process rather than attempting to process entire sequences at once, MemDreamer demonstrates that hierarchical memory structures combined with intelligent retrieval can match or exceed the performance of full-context approaches while requiring a fraction of the computational resources.

Key Takeaways

MemDreamer uses only 2% of the full context window to process hours-long videos, solving the token explosion problem that limits current Vision-Language Models
The system achieves state-of-the-art results across four video understanding benchmarks with a 12.5 point accuracy gain over baselines
Performance gap with human experts has been narrowed to just 3.7 points through hierarchical graph memory and agentic retrieval
Research reveals strong positive correlation between logic reasoning ability and long-video understanding, establishing agentic capability scaling as a new paradigm
The framework uses a plug-and-play architecture with DEFLEXPRESSOR and DEFLEXFORMER subsystems working together to build semantic abstractions

Hierarchical Graph Memory Solves Token Explosion Problem

The system employs two specialized subsystems working in tandem:

DEFLEXPRESSOR: A lambda-calculus symbolic regression model that generates higher-order formulas

DEFLEXFORMER: A decomposable deep energy model that learns unified representations across scales

Deflexpressor generates synthetic data to pre-train Deflexformer, which then guides formula discovery by decoupling multiscale latent relationships

Agentic Retrieval Enables Efficient Long-Context Processing

The results demonstrate substantial improvements over existing approaches:

Achieves state-of-the-art performance across four mainstream video understanding benchmarks

Delivers a 12.5 point absolute accuracy gain compared to baseline methods

Closes the performance gap with human experts to only 3.7 points

Processes hours of video content while maintaining minimal memory footprint

Strong Correlation Between Logic Reasoning and Video Understanding

Key Takeaways

MemDreamer uses only 2% of the full context window to process hours-long videos, solving the token explosion problem that limits current Vision-Language Models

The system achieves state-of-the-art results across four video understanding benchmarks with a 12.5 point accuracy gain over baselines

Performance gap with human experts has been narrowed to just 3.7 points through hierarchical graph memory and agentic retrieval

Research reveals strong positive correlation between logic reasoning ability and long-video understanding, establishing agentic capability scaling as a new paradigm

The framework uses a plug-and-play architecture with DEFLEXPRESSOR and DEFLEXFORMER subsystems working together to build semantic abstractions