Cross-Layer Sparse Attention Achieves 7.6x Decoding Speedup at 128K Context

Researchers from Microsoft and Tsinghua University have introduced Cross-Layer Sparse Attention (CLSA), a new method that delivers up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context length. The approach, detailed in a paper published June 4, 2026 on arXiv, addresses a critical bottleneck in long-context language model inference by sharing routing indices across decoder layers.

CLSA Builds on KV-Sharing Architectures

CLSA extends KV-sharing architectures like YOCO (You Only Cache Once) by sharing not only the key-value cache across layers, but also the routing index itself. A single indexer computes token-level top-k selection once, and the resulting index is reused across the entire decoder stack. This design preserves the fine-grained selectivity of token sparse attention while amortizing the expensive routing overhead across all layers.

Traditional sparse attention methods face an efficiency-quality trade-off: structured block sparse methods provide stronger acceleration but lose quality, while token sparse methods maintain accuracy but deliver limited speedup because top-k routing over the full cache remains computationally expensive at each layer.

Performance Across Multiple Inference Bottlenecks

CLSA delivers substantial improvements across all major inference bottlenecks:

7.6x decoding speedup at 128K context length
17.1x overall throughput improvement at 128K context
Jointly optimizes pre-filling, KV-cache storage, and long-context decoding
Maintains model quality across short-context and long-context benchmarks

The shared-indexer design removes redundancy by computing the top-k result once rather than independently at each layer, making sparse retrieval practically useful for decoding-heavy workloads.

Implications for Long-Context Applications

This research represents a fundamental rethinking of sparse attention implementation in decoder-only transformers. Rather than each layer independently computing expensive routing decisions, sharing routing across layers provides massive efficiency gains without sacrificing the accuracy benefits of token-level sparsity.

The technique makes long-context reasoning with 128K+ tokens more practical for production deployment by reducing both memory and compute requirements. This is particularly relevant for AI coding assistants, document analysis, and other applications requiring extensive context windows where models often generate long chains of thought during reasoning.

Key Takeaways

Cross-Layer Sparse Attention (CLSA) achieves 7.6x decoding speedup and 17.1x overall throughput at 128K context by sharing routing indices across decoder layers
The method builds on KV-sharing architectures like YOCO, computing token-level top-k selection once and reusing the index across all layers
CLSA jointly optimizes pre-filling, KV-cache storage, and long-context decoding while maintaining model quality on benchmarks
The approach makes 128K+ token context windows practical for production deployment in applications like AI coding assistants and document analysis
By amortizing routing overhead across the decoder stack, CLSA preserves token-sparse attention accuracy while delivering block-sparse efficiency gains

CLSA Builds on KV-Sharing Architectures

Performance Across Multiple Inference Bottlenecks

CLSA delivers substantial improvements across all major inference bottlenecks:

7.6x decoding speedup at 128K context length

17.1x overall throughput improvement at 128K context

Jointly optimizes pre-filling, KV-cache storage, and long-context decoding

Maintains model quality across short-context and long-context benchmarks

The shared-indexer design removes redundancy by computing the top-k result once rather than independently at each layer, making sparse retrieval practically useful for decoding-heavy workloads.

Implications for Long-Context Applications

Key Takeaways

Cross-Layer Sparse Attention (CLSA) achieves 7.6x decoding speedup and 17.1x overall throughput at 128K context by sharing routing indices across decoder layers

The method builds on KV-sharing architectures like YOCO, computing token-level top-k selection once and reusing the index across all layers

CLSA jointly optimizes pre-filling, KV-cache storage, and long-context decoding while maintaining model quality on benchmarks

The approach makes 128K+ token context windows practical for production deployment in applications like AI coding assistants and document analysis

By amortizing routing overhead across the decoder stack, CLSA preserves token-sparse attention accuracy while delivering block-sparse efficiency gains