Researchers from Microsoft and Tsinghua University have introduced Cross-Layer Sparse Attention (CLSA), a new method that delivers up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context length. The approach, detailed in a paper published June 4, 2026 on arXiv, addresses a critical bottleneck in long-context language model inference by sharing routing indices across decoder layers.
CLSA Builds on KV-Sharing Architectures
CLSA extends KV-sharing architectures like YOCO (You Only Cache Once) by sharing not only the key-value cache across layers, but also the routing index itself. A single indexer computes token-level top-k selection once, and the resulting index is reused across the entire decoder stack. This design preserves the fine-grained selectivity of token sparse attention while amortizing the expensive routing overhead across all layers.
Traditional sparse attention methods face an efficiency-quality trade-off: structured block sparse methods provide stronger acceleration but lose quality, while token sparse methods maintain accuracy but deliver limited speedup because top-k routing over the full cache remains computationally expensive at each layer.
Performance Across Multiple Inference Bottlenecks
CLSA delivers substantial improvements across all major inference bottlenecks:
- 7.6x decoding speedup at 128K context length
- 17.1x overall throughput improvement at 128K context
- Jointly optimizes pre-filling, KV-cache storage, and long-context decoding
- Maintains model quality across short-context and long-context benchmarks
The shared-indexer design removes redundancy by computing the top-k result once rather than independently at each layer, making sparse retrieval practically useful for decoding-heavy workloads.
Implications for Long-Context Applications
This research represents a fundamental rethinking of sparse attention implementation in decoder-only transformers. Rather than each layer independently computing expensive routing decisions, sharing routing across layers provides massive efficiency gains without sacrificing the accuracy benefits of token-level sparsity.
The technique makes long-context reasoning with 128K+ tokens more practical for production deployment by reducing both memory and compute requirements. This is particularly relevant for AI coding assistants, document analysis, and other applications requiring extensive context windows where models often generate long chains of thought during reasoning.
Key Takeaways
- Cross-Layer Sparse Attention (CLSA) achieves 7.6x decoding speedup and 17.1x overall throughput at 128K context by sharing routing indices across decoder layers
- The method builds on KV-sharing architectures like YOCO, computing token-level top-k selection once and reusing the index across all layers
- CLSA jointly optimizes pre-filling, KV-cache storage, and long-context decoding while maintaining model quality on benchmarks
- The approach makes 128K+ token context windows practical for production deployment in applications like AI coding assistants and document analysis
- By amortizing routing overhead across the decoder stack, CLSA preserves token-sparse attention accuracy while delivering block-sparse efficiency gains