Researchers have introduced Vortex, a programmable sparse attention serving system that enables AI agents to automatically generate and evaluate attention algorithms. Published June 4, 2026 on arXiv, the system achieved up to 4.7x throughput improvement on the 229B-parameter GLM model and enabled agent-generated algorithms to reach 3.46x higher throughput than full attention while preserving accuracy.
Programmable Frontend Accelerates Algorithm Discovery
Vortex consists of a Python-embedded language built atop a page-centric tensor abstraction for expressing sparse attention algorithms, tightly integrated with modern LLM serving stacks. The system addresses a critical bottleneck: deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the design space.
The programmable frontend makes Vortex accessible to researchers without deep systems expertise, enabling rapid prototyping, deployment, and evaluation of sparse attention variants. AI agents use Vortex to automatically generate diverse algorithms, rapidly evaluate their performance, and refine approaches based on results.
Performance Across Frontier-Scale Models
Vortex delivers substantial throughput improvements across multiple architectures and scales:
- 4.7x higher throughput on MLA-based GLM-4.7-Flash on NVIDIA B200 GPUs
- 1.37x throughput improvement on the 229B-parameter MiniMax-M2.7 model on NVIDIA B200 GPUs
- Best agent-generated algorithm achieved 3.46x speedup over full attention with preserved accuracy
The system extends sparse attention to emerging architectures like MLA (Multi-head Latent Attention) and very large models that are otherwise difficult to experiment with due to engineering complexity.
AI-Driven Algorithm Design Paradigm
Vortex represents a shift toward making AI systems programmable and explorable by other AI systems. Rather than humans hand-crafting attention algorithms, AI agents can now rapidly iterate through the design space, discovering novel attention patterns that humans might not consider.
The tight integration with modern LLM serving stacks means Vortex isn't just a research prototype—it's designed for real production deployment where sparse attention's efficiency gains directly translate to cost savings and latency improvements. The system effectively translates theoretical efficiency gains into real-world throughput improvements.
Bridging Research and Production
By making sparse attention algorithms programmable and quickly deployable, Vortex enables a new paradigm of AI-driven algorithm discovery. Algorithms can be expressed concisely in the Python-embedded frontend and deployed immediately for evaluation on production-scale models.
The system's ability to handle 229B-parameter models and deliver measurable throughput improvements demonstrates that programmable sparse attention can work at frontier scale, not just in controlled research settings. This makes Vortex relevant for organizations running large-scale LLM inference where even modest throughput gains translate to significant cost reductions.
Key Takeaways
- Vortex is a programmable sparse attention serving system that enables AI agents to automatically generate, deploy, and evaluate attention algorithms
- The system achieved 4.7x throughput improvement on GLM-4.7-Flash and 1.37x improvement on the 229B-parameter MiniMax-M2.7 model on NVIDIA B200 GPUs
- AI agent-generated algorithms reached up to 3.46x higher throughput than full attention while preserving model accuracy
- Vortex uses a Python-embedded language atop a page-centric tensor abstraction, making sparse attention research accessible without deep systems expertise
- The system bridges the gap between research ideas and production deployment, translating theoretical efficiency gains into real-world throughput improvements on frontier-scale models