Researchers have identified a fundamental architectural flaw in multimodal Mixture-of-Experts models that causes them to accurately perceive visual content but fail at subsequent reasoning tasks, even when they can correctly solve identical problems presented as pure text. The study, published on arXiv on April 9, 2026, proposes a routing-guided intervention method that improves performance by up to 3.17% on complex visual reasoning benchmarks.
Cross-Modal Routing Divergence Disrupts Expert Activation
The research team led by Haolei Xu analyzed multiple multimodal MoE architectures and discovered that the problem stems from routing distraction rather than semantic misalignment. Their systematic analysis revealed three key findings: cross-modal semantic sharing exists in MoE architectures, visual experts and domain experts exhibit layer-wise separation, and image inputs induce significant routing divergence from text inputs in middle layers where domain experts concentrate.
Routing-Guided Intervention Shows Consistent Improvements
The proposed solution enhances domain expert activation when processing visual inputs. Testing across three multimodal MoE models and six benchmarks demonstrated consistent improvements over baselines, with gains up to 3.17% on complex visual reasoning tasks. The intervention method identifies domain experts that represent cognitive functions rather than sample-specific solutions, enabling effective transfer across different tasks.
Architectural Issue Rather Than Fundamental Capability Deficit
The research demonstrates that the reasoning capabilities exist within these models but are not being properly activated for visual inputs. When the routing mechanism fails to adequately activate task-relevant reasoning experts, the models can perceive what they see but struggle to think about it effectively. This finding suggests that architectural improvements to routing mechanisms could unlock existing capabilities without requiring fundamental model retraining.
Key Takeaways
- Multimodal MoE models accurately perceive images but fail reasoning tasks they can solve when presented as pure text
- The problem stems from routing distraction where visual inputs fail to activate task-relevant reasoning experts in middle layers
- Routing-guided intervention method improves performance by up to 3.17% across six benchmarks
- The issue is architectural rather than a fundamental capability deficit—reasoning ability exists but isn't properly activated
- Domain expert identification enables effective transfer of intervention methods across different tasks