Vortex Enables AI Agents to Auto-Generate Sparse Attention Algorithms With 4.7x Speedup

Researchers have introduced Vortex, a new system that allows AI agents to automatically generate and optimize sparse attention algorithms for large language models, achieving up to 4.7x higher throughput on models with 229 billion parameters. The system addresses a critical bottleneck in LLM serving as generation lengths continue to grow, making sparse attention increasingly necessary for practical deployment.

Vortex Combines Programming Language With Efficient Backend

Vortex integrates a Python-embedded frontend language built on a page-centric tensor abstraction with an efficient backend tightly integrated into modern LLM serving stacks. This architecture enables researchers and AI agents to rapidly prototype, deploy, and evaluate a broad range of sparse attention algorithms without the typical engineering overhead. The system effectively translates theoretical efficiency gains from sparse attention into real-world throughput improvements.

AI Agents Automatically Generate High-Performance Algorithms

Using Vortex, AI agents autonomously generated and refined diverse sparse attention algorithms. The best-performing agent-generated algorithm achieved up to 3.46x higher throughput compared to full attention while preserving model accuracy. This demonstrates Vortex's ability to substantially accelerate the design and iteration cycle for sparse attention research.

System Extends to Emerging Architectures and Large-Scale Models

Vortex has been successfully applied to emerging architectures and very large models that are otherwise difficult to experiment with. On the MLA-based GLM-4.7-Flash model, the system achieved up to 4.7x higher throughput. On the 229B-parameter MiniMax-M2.7 model running on NVIDIA B200 GPUs, Vortex delivered 1.37x throughput improvements. These results demonstrate the system's scalability across different model architectures and sizes.

Key Takeaways

Vortex combines a Python-embedded frontend language with an efficient backend integrated into LLM serving stacks for rapid sparse attention algorithm development
AI agents using Vortex automatically generated algorithms achieving up to 3.46x higher throughput than full attention while maintaining accuracy
The system achieved 4.7x throughput improvement on the MLA-based GLM-4.7-Flash model
On the 229B-parameter MiniMax-M2.7 model with NVIDIA B200 GPUs, Vortex delivered 1.37x throughput gains
The research was published on arXiv (paper 2606.06453) on June 4, 2026, by authors including Zhuoming Chen, Xinrui Zhong, and Beidi Chen

Vortex Combines Programming Language With Efficient Backend

AI Agents Automatically Generate High-Performance Algorithms

System Extends to Emerging Architectures and Large-Scale Models

Key Takeaways

Vortex combines a Python-embedded frontend language with an efficient backend integrated into LLM serving stacks for rapid sparse attention algorithm development

AI agents using Vortex automatically generated algorithms achieving up to 3.46x higher throughput than full attention while maintaining accuracy

The system achieved 4.7x throughput improvement on the MLA-based GLM-4.7-Flash model

On the 229B-parameter MiniMax-M2.7 model with NVIDIA B200 GPUs, Vortex delivered 1.37x throughput gains

The research was published on arXiv (paper 2606.06453) on June 4, 2026, by authors including Zhuoming Chen, Xinrui Zhong, and Beidi Chen