Researchers have released FlashAttention-4, a major optimization for transformer attention mechanisms that achieves up to 1,613 TFLOPs/s (71% utilization) on NVIDIA's new Blackwell B200 GPUs. The breakthrough addresses a critical bottleneck as the industry transitions from Hopper to Blackwell architectures, where traditional approaches fail to exploit the hardware's full potential.
Blackwell's Asymmetric Hardware Scaling Creates New Bottlenecks
The Blackwell architecture presents a fundamental challenge: tensor core throughput doubled compared to Hopper H100 GPUs, but other functional units like shared memory bandwidth, exponential operations, and conditional operations scaled more slowly or remained unchanged. This asymmetric scaling means that operations like softmax and memory transfers now dominate execution time, preventing attention mechanisms from reaching theoretical peak performance.
FlashAttention-4 redesigns the algorithmic pipeline to work around these limitations. The system introduces software-emulated exponential and conditional softmax rescaling to reduce non-matrix-multiplication operations, leverages tensor memory and 2-CTA MMA mode to reduce shared memory traffic, and exploits fully asynchronous MMA operations with larger tile sizes.
Performance Gains Reach 2.7x Over Existing Implementations
The new implementation delivers substantial speedups across the board:
- Up to 1.3x faster than cuDNN 9.13
- Up to 2.7x faster than Triton
- Achieves near-matrix-multiplication speed for attention operations
- Reduces compile times by 20-30x compared to traditional C++ template approaches
PyTorch Integration Removes Performance Ceiling for FlexAttention
PyTorch announced that FlexAttention now includes a FlashAttention-4 backend, addressing a longstanding performance limitation. FlexAttention has been adopted by over 1,000 repositories and cited in dozens of papers for prototyping custom attention variants, but users consistently encountered performance ceilings when moving from research to production.
The implementation was built entirely in CuTe-DSL embedded in Python, achieving significantly faster compile times while maintaining full expressivity. The paper, authored by Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao, was published on arXiv on March 5, 2026.
Key Takeaways
- FlashAttention-4 achieves 1,613 TFLOPs/s (71% utilization) on NVIDIA B200 GPUs, up to 2.7x faster than Triton
- Blackwell's doubled tensor core throughput creates asymmetric scaling where exponential operations and shared memory become bottlenecks
- Software-emulated softmax operations and redesigned memory pipelines eliminate non-matmul performance limitations
- PyTorch's FlexAttention now includes FlashAttention-4 backend, removing performance ceilings for 1,000+ repositories
- Implementation uses CuTe-DSL in Python, achieving 20-30x faster compile times than traditional C++ approaches