vLLM released version 0.17.0 on March 7, 2026, integrating FlashAttention-4 optimization algorithms into the widely-used LLM inference library. The release included 699 commits from 272 contributors, including 48 new contributors, marking one of the project's largest updates.
FlashAttention-4 Achieves 1613 TFLOPs/s on NVIDIA Blackwell GPUs
The integration comes immediately after FlashAttention-4's release, which was designed specifically for NVIDIA Blackwell B200 GPUs. The algorithm achieves 71% GPU utilization and up to 1613 TFLOPs/s, approaching matmul speed performance.
Ted Zadouri, one of the FlashAttention-4 authors, explained the breakthrough: "Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed!"
The technical innovations include redesigned pipelines exploiting fully asynchronous MMA operations with larger tile sizes, software-emulated exponential and conditional softmax rescaling to reduce non-matmul operations, and leveraging tensor memory and 2-CTA MMA mode to reduce shared memory traffic.
PyTorch Adds FlashAttention-4 Backend to FlexAttention
PyTorch announced concurrent support for FlashAttention-4 through its FlexAttention API. The integration enables researchers to rapidly prototype custom attention mechanisms while benefiting from the performance optimizations designed for Blackwell architecture.
FlexAttention has become a key tool for researchers experimenting with novel attention patterns, and the FlashAttention-4 backend provides production-grade performance for these experiments.
vLLM Makes Cutting-Edge Optimization Available to Production Users
The vLLM integration makes FlashAttention-4 optimizations immediately available to the production inference community. vLLM is one of the most popular open-source libraries for LLM serving and inference optimization, widely deployed in both research and production environments.
The 0.17.0 release represents significant community collaboration, with 48 new contributors joining the project. This level of engagement demonstrates vLLM's central role in the LLM infrastructure ecosystem.
The announcement received strong community response with 60 retweets on the vLLM project account, while Zadouri's technical explanation garnered 756 likes, 128 retweets, 393 bookmarks, and 198,822 impressions, indicating substantial interest from ML practitioners.
Key Takeaways
- vLLM version 0.17.0 was released on March 7, 2026, with 699 commits from 272 contributors including 48 new contributors
- The release integrates FlashAttention-4, which achieves 71% GPU utilization and up to 1613 TFLOPs/s on NVIDIA Blackwell B200 GPUs
- FlashAttention-4 uses redesigned pipelines with asynchronous MMA operations, software-emulated exponential operations, and tensor memory optimization to reduce bottlenecks
- PyTorch added FlashAttention-4 backend to FlexAttention, enabling rapid prototyping of custom attention mechanisms with production-grade performance
- The integration makes cutting-edge optimization algorithms immediately available to production LLM inference deployments through vLLM's widely-adopted library