IndexCache Achieves 75% Speedup for DeepSeek Sparse Attention Through Cross-Layer Index Reuse

Researchers from Tsinghua University have developed IndexCache, a technique that eliminates up to 75% of indexer computations in DeepSeek Sparse Attention (DSA) models while maintaining quality. On a 30B parameter DSA model, the method achieves up to 1.82× prefill speedup and 1.48× decode speedup compared to standard DSA implementation. The paper, published on arXiv on March 12, 2026, introduces both training-free and training-aware variants of the approach.

DeepSeek Sparse Attention's Redundant Indexer Problem

DeepSeek Sparse Attention reduces core attention complexity from O(L²) to O(Lk) by using a lightweight lightning indexer to select the top-k most relevant tokens per query at each layer. However, the indexer itself retains O(L²) complexity and must run independently at every layer. The key insight: adjacent layers share 70-100% of their selected tokens, meaning most indexer computations are redundant.

Two IndexCache Approaches Deliver Different Trade-offs

IndexCache partitions layers into Full layers that run their own indexers and Shared layers that reuse the nearest Full layer's top-k indices. The training-free variant applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. The training-aware variant introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy.

Production-Scale Results Confirmed on GLM-5 Model

The researchers validated their approach on both a 30B DSA model and preliminary experiments on the production-scale GLM-5 model. The technique proves particularly valuable for long-context agentic workflows, which have emerged as a defining use case for large language models. By reducing indexer overhead, IndexCache addresses a critical bottleneck in both inference speed and serving cost for long-context applications.

Key Takeaways

IndexCache eliminates up to 75% of indexer computations in DeepSeek Sparse Attention models with negligible quality loss
Adjacent layers in DSA models share 70-100% of their selected tokens, revealing massive redundancy in standard implementations
The technique achieves up to 1.82× prefill speedup and 1.48× decode speedup on a 30B parameter model
Both training-free and training-aware variants are available, offering different trade-offs between implementation complexity and performance
Results have been confirmed on production-scale GLM-5 model, validating real-world applicability

DeepSeek Sparse Attention's Redundant Indexer Problem

Two IndexCache Approaches Deliver Different Trade-offs

Production-Scale Results Confirmed on GLM-5 Model

Key Takeaways

IndexCache eliminates up to 75% of indexer computations in DeepSeek Sparse Attention models with negligible quality loss

Adjacent layers in DSA models share 70-100% of their selected tokens, revealing massive redundancy in standard implementations

The technique achieves up to 1.82× prefill speedup and 1.48× decode speedup on a 30B parameter model

Both training-free and training-aware variants are available, offering different trade-offs between implementation complexity and performance

Results have been confirmed on production-scale GLM-5 model, validating real-world applicability