Multimodal AI Models Suffer 'Seeing but Not Thinking' Routing Problem in Visual Reasoning

Researchers have identified a fundamental architectural flaw in multimodal Mixture-of-Experts models that causes them to accurately perceive visual content but fail at subsequent reasoning tasks, even when they can correctly solve identical problems presented as pure text. The study, published on arXiv on April 9, 2026, proposes a routing-guided intervention method that improves performance by up to 3.17% on complex visual reasoning benchmarks.

Cross-Modal Routing Divergence Disrupts Expert Activation

The research team led by Haolei Xu analyzed multiple multimodal MoE architectures and discovered that the problem stems from routing distraction rather than semantic misalignment. Their systematic analysis revealed three key findings: cross-modal semantic sharing exists in MoE architectures, visual experts and domain experts exhibit layer-wise separation, and image inputs induce significant routing divergence from text inputs in middle layers where domain experts concentrate.

Routing-Guided Intervention Shows Consistent Improvements

The proposed solution enhances domain expert activation when processing visual inputs. Testing across three multimodal MoE models and six benchmarks demonstrated consistent improvements over baselines, with gains up to 3.17% on complex visual reasoning tasks. The intervention method identifies domain experts that represent cognitive functions rather than sample-specific solutions, enabling effective transfer across different tasks.

Architectural Issue Rather Than Fundamental Capability Deficit

The research demonstrates that the reasoning capabilities exist within these models but are not being properly activated for visual inputs. When the routing mechanism fails to adequately activate task-relevant reasoning experts, the models can perceive what they see but struggle to think about it effectively. This finding suggests that architectural improvements to routing mechanisms could unlock existing capabilities without requiring fundamental model retraining.

Key Takeaways

Multimodal MoE models accurately perceive images but fail reasoning tasks they can solve when presented as pure text
The problem stems from routing distraction where visual inputs fail to activate task-relevant reasoning experts in middle layers
Routing-guided intervention method improves performance by up to 3.17% across six benchmarks
The issue is architectural rather than a fundamental capability deficit—reasoning ability exists but isn't properly activated
Domain expert identification enables effective transfer of intervention methods across different tasks

Cross-Modal Routing Divergence Disrupts Expert Activation

Routing-Guided Intervention Shows Consistent Improvements

Architectural Issue Rather Than Fundamental Capability Deficit

Key Takeaways

Multimodal MoE models accurately perceive images but fail reasoning tasks they can solve when presented as pure text

The problem stems from routing distraction where visual inputs fail to activate task-relevant reasoning experts in middle layers

Routing-guided intervention method improves performance by up to 3.17% across six benchmarks

The issue is architectural rather than a fundamental capability deficit—reasoning ability exists but isn't properly activated

Domain expert identification enables effective transfer of intervention methods across different tasks