InternVL 3.5 Matches GPT-4o on Vision-Language Tasks with Open-Source 241B Sparse MoE

OpenGVLab and Shanghai AI Lab released InternVL 3.5, an open-source multimodal large language model that matches GPT-4o's vision-language reasoning capabilities while remaining fully deployable on-premises. The flagship InternVL3.5-241B-A28B model contains 241 billion parameters with 28 billion activated, achieving state-of-the-art results among open-source MLLMs and surpassing GPT-4o on RealWorldQA while considerably outperforming it on MME-RealWorld.

Cascade Reinforcement Learning Drives 16% Reasoning Improvement

InternVL 3.5 introduces Cascade RL, a two-stage reinforcement learning process that enhanced reasoning performance by 16.0% over InternVL3. The approach combines offline RL for stable convergence with online RL for refined alignment, addressing the challenge of improving reasoning capabilities in large multimodal models. This architectural innovation enables the model to narrow the performance gap with leading commercial models like GPT-5 while maintaining open-source accessibility.

The model family includes specialized variants available on Hugging Face, including OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview and other configurations optimized for different deployment scenarios. The ArXiv paper (2508.18265) "InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency" provides detailed technical documentation.

Visual Resolution Router and Decoupled Deployment Enable Efficient Scaling

Two key innovations enable practical deployment of the massive model. The Visual Resolution Router (ViR) dynamically adjusts the resolution of visual tokens, maintaining performance while optimizing compute resources for efficient processing of high-resolution images. Decoupled Vision-Language Deployment (DvD) separates the vision encoder and language model across different GPUs, effectively balancing computational load and enabling deployment of models that would otherwise require impractical hardware configurations.

These optimizations deliver a 4.05x inference speedup compared to previous approaches while maintaining or improving accuracy across benchmarks. The efficiency gains make on-premises deployment viable for organizations that previously required cloud-based commercial APIs for equivalent capabilities.

Novel Agentic Capabilities Extend Beyond Traditional Vision-Language Tasks

InternVL 3.5 introduces capabilities extending into GUI interaction and embodied agency, enabling the model to interact with graphical interfaces and serve as a physical world agent. These agentic workflows represent an expansion beyond traditional vision-language tasks like image captioning and visual question answering, positioning the model for robotics and autonomous systems applications.

The GitHub repository (OpenGVLab/InternVL) is tagged as "CVPR 2024 Oral" and described as "A Pioneering Open-Source Alternative to GPT-4o." Integration with the Axolotl training framework is documented, facilitating fine-tuning for domain-specific applications. The sparse MoE architecture enables deployment without data center-scale infrastructure, addressing a major barrier to on-premises multimodal AI deployment.

Key Takeaways

InternVL3.5-241B-A28B surpasses GPT-4o on RealWorldQA and considerably outperforms it on MME-RealWorld while matching performance on R-Bench
Cascade RL delivers a 16.0% gain in overall reasoning performance over InternVL3 through two-stage offline and online reinforcement learning
The Visual Resolution Router and Decoupled Vision-Language Deployment achieve 4.05x inference speedup while maintaining accuracy
Sparse MoE architecture activates 28B of 241B parameters, enabling on-premises deployment without data center infrastructure
Novel capabilities include GUI interaction and embodied agency for robotics and autonomous systems applications beyond traditional vision-language tasks

Cascade Reinforcement Learning Drives 16% Reasoning Improvement

Visual Resolution Router and Decoupled Deployment Enable Efficient Scaling

Novel Agentic Capabilities Extend Beyond Traditional Vision-Language Tasks

Key Takeaways

InternVL3.5-241B-A28B surpasses GPT-4o on RealWorldQA and considerably outperforms it on MME-RealWorld while matching performance on R-Bench

Cascade RL delivers a 16.0% gain in overall reasoning performance over InternVL3 through two-stage offline and online reinforcement learning

The Visual Resolution Router and Decoupled Vision-Language Deployment achieve 4.05x inference speedup while maintaining accuracy

Sparse MoE architecture activates 28B of 241B parameters, enabling on-premises deployment without data center infrastructure

Novel capabilities include GUI interaction and embodied agency for robotics and autonomous systems applications beyond traditional vision-language tasks