FlashAttention-4 Achieves 71% GPU Utilization on NVIDIA Blackwell B200

Researchers have released FlashAttention-4, a major optimization for transformer attention mechanisms that achieves up to 1,613 TFLOPs/s (71% utilization) on NVIDIA's new Blackwell B200 GPUs. The breakthrough addresses a critical bottleneck as the industry transitions from Hopper to Blackwell architectures, where traditional approaches fail to exploit the hardware's full potential.

Blackwell's Asymmetric Hardware Scaling Creates New Bottlenecks

The Blackwell architecture presents a fundamental challenge: tensor core throughput doubled compared to Hopper H100 GPUs, but other functional units like shared memory bandwidth, exponential operations, and conditional operations scaled more slowly or remained unchanged. This asymmetric scaling means that operations like softmax and memory transfers now dominate execution time, preventing attention mechanisms from reaching theoretical peak performance.

FlashAttention-4 redesigns the algorithmic pipeline to work around these limitations. The system introduces software-emulated exponential and conditional softmax rescaling to reduce non-matrix-multiplication operations, leverages tensor memory and 2-CTA MMA mode to reduce shared memory traffic, and exploits fully asynchronous MMA operations with larger tile sizes.

Performance Gains Reach 2.7x Over Existing Implementations

The new implementation delivers substantial speedups across the board:

Up to 1.3x faster than cuDNN 9.13
Up to 2.7x faster than Triton
Achieves near-matrix-multiplication speed for attention operations
Reduces compile times by 20-30x compared to traditional C++ template approaches

PyTorch Integration Removes Performance Ceiling for FlexAttention

PyTorch announced that FlexAttention now includes a FlashAttention-4 backend, addressing a longstanding performance limitation. FlexAttention has been adopted by over 1,000 repositories and cited in dozens of papers for prototyping custom attention variants, but users consistently encountered performance ceilings when moving from research to production.

The implementation was built entirely in CuTe-DSL embedded in Python, achieving significantly faster compile times while maintaining full expressivity. The paper, authored by Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, and Tri Dao, was published on arXiv on March 5, 2026.

Key Takeaways

FlashAttention-4 achieves 1,613 TFLOPs/s (71% utilization) on NVIDIA B200 GPUs, up to 2.7x faster than Triton
Blackwell's doubled tensor core throughput creates asymmetric scaling where exponential operations and shared memory become bottlenecks
Software-emulated softmax operations and redesigned memory pipelines eliminate non-matmul performance limitations
PyTorch's FlexAttention now includes FlashAttention-4 backend, removing performance ceilings for 1,000+ repositories
Implementation uses CuTe-DSL in Python, achieving 20-30x faster compile times than traditional C++ approaches

Blackwell's Asymmetric Hardware Scaling Creates New Bottlenecks

PyTorch Integration Removes Performance Ceiling for FlexAttention

Key Takeaways

FlashAttention-4 achieves 1,613 TFLOPs/s (71% utilization) on NVIDIA B200 GPUs, up to 2.7x faster than Triton

Blackwell's doubled tensor core throughput creates asymmetric scaling where exponential operations and shared memory become bottlenecks

Software-emulated softmax operations and redesigned memory pipelines eliminate non-matmul performance limitations

PyTorch's FlexAttention now includes FlashAttention-4 backend, removing performance ceilings for 1,000+ repositories

Implementation uses CuTe-DSL in Python, achieving 20-30x faster compile times than traditional C++ approaches