Researchers from Tsinghua University have developed IndexCache, a technique that eliminates up to 75% of indexer computations in DeepSeek Sparse Attention (DSA) models while maintaining quality. On a 30B parameter DSA model, the method achieves up to 1.82× prefill speedup and 1.48× decode speedup compared to standard DSA implementation. The paper, published on arXiv on March 12, 2026, introduces both training-free and training-aware variants of the approach.
DeepSeek Sparse Attention's Redundant Indexer Problem
DeepSeek Sparse Attention reduces core attention complexity from O(L²) to O(Lk) by using a lightweight lightning indexer to select the top-k most relevant tokens per query at each layer. However, the indexer itself retains O(L²) complexity and must run independently at every layer. The key insight: adjacent layers share 70-100% of their selected tokens, meaning most indexer computations are redundant.
Two IndexCache Approaches Deliver Different Trade-offs
IndexCache partitions layers into Full layers that run their own indexers and Shared layers that reuse the nearest Full layer's top-k indices. The training-free variant applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. The training-aware variant introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy.
Production-Scale Results Confirmed on GLM-5 Model
The researchers validated their approach on both a 30B DSA model and preliminary experiments on the production-scale GLM-5 model. The technique proves particularly valuable for long-context agentic workflows, which have emerged as a defining use case for large language models. By reducing indexer overhead, IndexCache addresses a critical bottleneck in both inference speed and serving cost for long-context applications.
Key Takeaways
- IndexCache eliminates up to 75% of indexer computations in DeepSeek Sparse Attention models with negligible quality loss
- Adjacent layers in DSA models share 70-100% of their selected tokens, revealing massive redundancy in standard implementations
- The technique achieves up to 1.82× prefill speedup and 1.48× decode speedup on a 30B parameter model
- Both training-free and training-aware variants are available, offering different trade-offs between implementation complexity and performance
- Results have been confirmed on production-scale GLM-5 model, validating real-world applicability