Researchers have introduced Vortex, a new system that allows AI agents to automatically generate and optimize sparse attention algorithms for large language models, achieving up to 4.7x higher throughput on models with 229 billion parameters. The system addresses a critical bottleneck in LLM serving as generation lengths continue to grow, making sparse attention increasingly necessary for practical deployment.
Vortex Combines Programming Language With Efficient Backend
Vortex integrates a Python-embedded frontend language built on a page-centric tensor abstraction with an efficient backend tightly integrated into modern LLM serving stacks. This architecture enables researchers and AI agents to rapidly prototype, deploy, and evaluate a broad range of sparse attention algorithms without the typical engineering overhead. The system effectively translates theoretical efficiency gains from sparse attention into real-world throughput improvements.
AI Agents Automatically Generate High-Performance Algorithms
Using Vortex, AI agents autonomously generated and refined diverse sparse attention algorithms. The best-performing agent-generated algorithm achieved up to 3.46x higher throughput compared to full attention while preserving model accuracy. This demonstrates Vortex's ability to substantially accelerate the design and iteration cycle for sparse attention research.
System Extends to Emerging Architectures and Large-Scale Models
Vortex has been successfully applied to emerging architectures and very large models that are otherwise difficult to experiment with. On the MLA-based GLM-4.7-Flash model, the system achieved up to 4.7x higher throughput. On the 229B-parameter MiniMax-M2.7 model running on NVIDIA B200 GPUs, Vortex delivered 1.37x throughput improvements. These results demonstrate the system's scalability across different model architectures and sizes.
Key Takeaways
- Vortex combines a Python-embedded frontend language with an efficient backend integrated into LLM serving stacks for rapid sparse attention algorithm development
- AI agents using Vortex automatically generated algorithms achieving up to 3.46x higher throughput than full attention while maintaining accuracy
- The system achieved 4.7x throughput improvement on the MLA-based GLM-4.7-Flash model
- On the 229B-parameter MiniMax-M2.7 model with NVIDIA B200 GPUs, Vortex delivered 1.37x throughput gains
- The research was published on arXiv (paper 2606.06453) on June 4, 2026, by authors including Zhuoming Chen, Xinrui Zhong, and Beidi Chen