IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
IndexCache accelerates sparse attention by reusing cross-layer indices, reducing 75% of computations, achieving 1.82x speedup.
Key Findings
Methodology
IndexCache partitions layers into Full layers, which retain their indexers, and Shared layers, which reuse the top-k indices from the nearest Full layer. Two complementary approaches are proposed: training-free IndexCache uses a greedy search algorithm to directly minimize language modeling loss, while training-aware IndexCache introduces a multi-layer distillation loss to train each retained indexer against the averaged attention distributions of all layers it serves.
Key Results
- On a 30B DSA model, IndexCache removes 75% of indexer computations with negligible quality degradation, achieving up to 1.82x prefill speedup and 1.48x decode speedup compared to standard DSA.
- Preliminary experiments on the production-scale GLM-5 model further confirm these positive results, with IndexCache removing 50% of indexer computations while maintaining comparable performance across both long-context and reasoning tasks.
- Experiments show that IndexCache improves decode throughput by 22-51% at a 200K context length, with significant gains at longer contexts.
Significance
IndexCache significantly reduces computational overhead in long-context inference, particularly in scenarios requiring efficient processing of large-scale data. By reducing the computational complexity of indexers, it provides a new solution for improving inference efficiency in large-scale language models, addressing the bottleneck of sparse attention in long-context applications.
Technical Contribution
IndexCache significantly reduces the computational complexity of sparse attention through cross-layer index reuse. Unlike existing methods, it does not rely on full attention layers as anchors for indexers but achieves efficient top-k selection through a lightweight indexer. Its innovative training-aware distillation loss offers new possibilities for model training and optimization.
Novelty
IndexCache is the first to achieve cross-layer index reuse in sparse attention, significantly reducing computational overhead. Compared to existing methods, it achieves efficient top-k selection through a lightweight indexer without relying on full attention layers.
Limitations
- At extreme sparsity (retaining only 1/8 of indexer layers), there is a significant drop in long-context performance, indicating that indexer reuse may lead to quality degradation in some cases.
- While IndexCache performs well in most cases, further tuning may be required for specific tasks to ensure performance is not affected.
- Current experiments focus primarily on specific models and tasks, and its generalizability to broader application scenarios has yet to be verified.
Future Work
Future research directions include applying training-aware IndexCache to larger-scale models and exploring its performance across different tasks and datasets. Additionally, further optimization of indexer selection and reuse strategies could be studied to enhance model adaptability and performance.
AI Executive Summary
Long-context inference is a critical application scenario for modern large-scale language models, and sparse attention is an effective method to address this challenge. Traditional sparse attention mechanisms, such as DeepSeek Sparse Attention (DSA), use a lightweight indexer to select the top-k most relevant tokens for each query, reducing core attention computation from O(L^2) to O(Lk). However, the indexer itself retains O(L^2) complexity and must run independently at every layer, despite the high similarity of top-k selections across layers.
IndexCache addresses this redundancy by reusing cross-layer indices. It partitions layers into a small set of Full layers that retain their indexers and a majority of Shared layers that reuse the top-k indices from the nearest Full layer. Two complementary approaches are proposed to determine and optimize this configuration: training-free IndexCache applies a greedy search algorithm to directly minimize language modeling loss, while training-aware IndexCache introduces a multi-layer distillation loss to train each retained indexer against the averaged attention distributions of all layers it serves.
Experimental results show that on a 30B parameter DSA model, IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82x prefill speedup and 1.48x decode speedup compared to standard DSA. These positive results are further confirmed by preliminary experiments on the production-scale GLM-5 model, where IndexCache removes 50% of indexer computations while maintaining comparable performance across both long-context and reasoning tasks.
The significance of IndexCache lies in its ability to significantly reduce computational overhead in long-context inference, particularly in scenarios requiring efficient processing of large-scale data. By reducing the computational complexity of indexers, it provides a new solution for improving inference efficiency in large-scale language models, addressing the bottleneck of sparse attention in long-context applications.
However, at extreme sparsity (retaining only 1/8 of indexer layers), there is a significant drop in long-context performance, indicating that indexer reuse may lead to quality degradation in some cases. Future research directions include applying training-aware IndexCache to larger-scale models and exploring its performance across different tasks and datasets. Additionally, further optimization of indexer selection and reuse strategies could be studied to enhance model adaptability and performance.
Deep Analysis
Background
The self-attention mechanism is a cornerstone of modern large-scale language models, yet its quadratic complexity in sequence length presents a fundamental bottleneck for long-context inference. As large-scale language models are increasingly deployed in settings that demand extended contexts, such as long chain-of-thought reasoning, multi-step agentic workflows, and retrieval-augmented generation over web-scale sources, reducing attention cost without sacrificing model quality has become a critical research problem. Sparse attention offers a principled solution: instead of attending to all preceding tokens, each query selects only the most relevant subset. Among recent approaches, DeepSeek Sparse Attention (DSA) stands out as a production-grade trainable sparse attention mechanism. For sparse token selection, DSA introduces an additional lightweight indexer that scores all preceding tokens and selects the top-k for the subsequent core attention. This reduces per-layer core attention from O(L^2) to O(Lk) while preserving model quality through continued pre-training. However, the indexer itself still operates at O(L^2) complexity and must independently score all preceding tokens at every layer, which becomes a significant fraction of the total attention budget at long context lengths.
Core Problem
In long-context inference, the computational cost of sparse attention's indexer becomes a bottleneck. Despite the high similarity of top-k selections across layers, each layer's indexer must run independently, leading to substantial redundant computations. How to effectively leverage the cross-layer stability of index selections to reduce unnecessary indexer computations while maintaining model quality is a pressing issue.
Innovation
IndexCache significantly reduces the computational complexity of sparse attention through cross-layer index reuse. Key innovations include:
1. Layer partitioning strategy: layers are divided into Full layers, which retain their indexers, and Shared layers, which reuse the top-k indices from the nearest Full layer.
2. Training-free optimization: a greedy search algorithm is proposed to select which layers retain indexers by directly minimizing language modeling loss.
3. Training-aware optimization: a multi-layer distillation loss is introduced to train each retained indexer against the averaged attention distributions of all layers it serves.
Methodology
The methodology of IndexCache includes the following steps:
- οΏ½οΏ½ Layer partitioning: divide model layers into Full layers, which retain their indexers, and Shared layers, which reuse the top-k indices from the nearest Full layer.
- οΏ½οΏ½ Training-free optimization: apply a greedy search algorithm to select which layers retain indexers by directly minimizing language modeling loss.
- οΏ½οΏ½ Training-aware optimization: introduce a multi-layer distillation loss to train each retained indexer against the averaged attention distributions of all layers it serves.
- οΏ½οΏ½ Experimental validation: conduct experiments on a 30B parameter DSA model and a production-scale GLM-5 model to validate the effectiveness of IndexCache.
Experiments
The experimental design includes validation on a 30B parameter DSA model and a production-scale GLM-5 model. Benchmark datasets used include OpenAI's GraphWalks, LongBench v2, RULER, and AA-LCR, as well as four general and reasoning benchmarks: AIME 2025, GPQA-Diamond, LiveCodeBench v6, and IFBench. The experiments compare the original DSA baseline against IndexCache at two retention ratios: 1/2 (half of the indexer layers retained) and 1/4 (a quarter retained).
Results
Experimental results show that on a 30B DSA model, IndexCache removes 75% of indexer computations with negligible quality degradation, achieving up to 1.82x prefill speedup and 1.48x decode speedup compared to standard DSA. Preliminary experiments on the production-scale GLM-5 model further confirm these positive results, with IndexCache removing 50% of indexer computations while maintaining comparable performance across both long-context and reasoning tasks. Additionally, IndexCache improves decode throughput by 22-51% at a 200K context length, with significant gains at longer contexts.
Applications
Application scenarios for IndexCache include:
1. Long-context inference in large-scale language models, particularly in scenarios requiring efficient processing of large-scale data.
2. Improving inference efficiency and reducing computational costs in real-time online services.
3. Deploying large-scale models in resource-constrained environments, reducing computational resource consumption.
Limitations & Outlook
While IndexCache performs well in most cases, at extreme sparsity (retaining only 1/8 of indexer layers), there is a significant drop in long-context performance. Additionally, current experiments focus primarily on specific models and tasks, and its generalizability to broader application scenarios has yet to be verified. Future research directions include applying training-aware IndexCache to larger-scale models and exploring its performance across different tasks and datasets.
Plain Language Accessible to non-experts
Imagine you're shopping in a large supermarket. There are thousands of products, and each time you shop, you need to find the items you need on the shelves. The traditional method is to browse all the shelves each time to find what you need, which is like the full attention mechanism that processes all possible options. However, this is inefficient, especially when the supermarket is large.
Now, suppose the supermarket offers a smart shopping assistant that, based on your shopping list and history, pre-selects the most relevant products for you and tells you their locations when you arrive. This is like the sparse attention mechanism, which focuses only on the most relevant options, saving a lot of time and effort.
However, this smart assistant recalculates the locations of all products each time you shop, even if their locations haven't changed much. IndexCache is like a memory function that remembers the product locations from the last shopping trip and reuses this information the next time, updating only when necessary. This greatly reduces the assistant's computation and improves shopping efficiency.
In this way, IndexCache helps save computational resources when processing large amounts of data, significantly improving efficiency, especially in scenarios requiring quick responses.
ELI14 Explained like you're 14
Imagine you're playing a massive multiplayer online game. There are many quests and challenges, each with many steps you need to complete to earn rewards. The traditional method is to start from scratch and complete each step every time, which is like the full attention mechanism that processes all possible options.
But this is inefficient, especially when the quests are complex. So, the game developers introduce a smart assistant that, based on your game history and current quest, selects the most relevant steps for you and guides you through the quest. This is like the sparse attention mechanism, which focuses only on the most relevant options, saving a lot of time and effort.
However, this assistant recalculates all the steps each time, even if they haven't changed much. IndexCache is like a memory function that remembers the steps from the last quest and reuses this information the next time, updating only when necessary. This greatly reduces the assistant's computation and improves game efficiency.
In this way, IndexCache helps save computational resources when handling many tasks, significantly improving efficiency, especially in scenarios requiring quick responses.
Glossary
Sparse Attention
Sparse attention is a mechanism that focuses only on the most relevant subset, reducing computational complexity.
Used in IndexCache to select the top-k most relevant tokens for each query.
Indexer
An indexer is a component used to score and select the most relevant tokens, determining the computational efficiency of sparse attention.
In DSA, the indexer scores all preceding tokens and selects the top-k.
Cross-Layer Index Reuse
Cross-layer index reuse refers to sharing index results between different layers to reduce redundant computations.
IndexCache reduces 75% of indexer computations through cross-layer index reuse.
Greedy Search Algorithm
A greedy search algorithm is an optimization method that incrementally selects the optimal solution to minimize loss.
Used in training-free IndexCache to select which layers retain indexers.
Multi-Layer Distillation Loss
A multi-layer distillation loss is a training strategy that trains indexers to match the averaged attention distributions of all layers they serve.
Used in training-aware IndexCache to train retained indexers.
Full Layer
A Full layer is a layer that retains its indexer and computes fresh top-k indices.
In IndexCache, Full layers retain their own indexers.
Shared Layer
A Shared layer is a layer that reuses the top-k indices from the nearest Full layer, reducing indexer computations.
In IndexCache, Shared layers reuse the top-k indices from the nearest Full layer.
Long-Context Inference
IndexCache improves long-context inference efficiency by reducing indexer computations.
DeepSeek Sparse Attention (DSA)
DSA is a production-grade trainable sparse attention mechanism that uses a lightweight indexer to select the top-k most relevant tokens.
IndexCache is validated on a DSA model.
GLM-5 Model
The GLM-5 model is a production-scale large language model on which preliminary experiments of IndexCache were conducted.
Used to verify the effectiveness of IndexCache in production environments.
Open Questions Unanswered questions from this research
- 1 How can IndexCache be applied to larger-scale models and verified across different tasks and datasets? Current experiments focus primarily on specific models and tasks, and its generalizability to broader application scenarios has yet to be verified.
- 2 How can indexer selection and reuse strategies be further optimized under extreme sparsity to enhance model adaptability and performance?
- 3 Can the cross-layer index reuse method of IndexCache be applied to other types of attention mechanisms to improve computational efficiency?
- 4 How can the computational complexity of indexers be further reduced without affecting model quality?
- 5 How do hardware and software environments impact the performance and efficiency of IndexCache in practical applications?
Applications
Immediate Applications
Real-Time Online Services
By reducing computational overhead, IndexCache can improve inference efficiency and reduce computational costs in real-time online services requiring quick responses.
Large-Scale Data Processing
In scenarios requiring the processing of large-scale data, IndexCache improves long-context inference efficiency by reducing indexer computational complexity.
Model Deployment in Resource-Constrained Environments
In environments with limited computational resources, IndexCache can reduce resource consumption, making large-scale model deployment more feasible.
Long-term Vision
Proliferation of Large-Scale Language Models
By improving inference efficiency, IndexCache is expected to promote the proliferation of large-scale language models in more application scenarios.
Optimization of Smart Assistants
The cross-layer index reuse method of IndexCache can be used to optimize the computational efficiency of smart assistants, improving user experience.
Abstract
Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).
References (20)
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI, A. Liu, Aoxue Mei et al.
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Xunhao Lai, Jianqiao Lu, Yao Luo et al.
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
Zhoutong Wu, Yuan Zhang, Yiming Dong et al.
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang et al.
Kimi Linear: An Expressive, Efficient Attention Architecture
Yu Zhang, Zongyu Lin, Xingcheng Yao et al.
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Shang Yang, Junxian Guo, Haotian Tang et al.
SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
Jintao Zhang, Chendong Xiang, Haofeng Huang et al.
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu (Allen) Zhang, Ying Sheng, Tianyi Zhou et al.
GLM-5: from Vibe Coding to Agentic Engineering
GLM-4.5 Team Aohan Zeng, Xin Lv, Zhenyu Hou et al.
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
Z. M. K. Zuhri, Farid Adilazuarda, Ayu Purwarianti et al.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Tri Dao, Albert Gu
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Yizhao Gao, Zhichen Zeng, Dayou Du et al.
InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation
Weilin Zhao, Zihan Zhou, Zhou Su et al.
OmniKV: Dynamic Context Selection for Efficient Long-Context LLMs
Jitai Hao, Yuke Zhu, Tianjian Wang et al.
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax, Aonian Li, Bangwei Gong et al.
XAttention: Block Sparse Attention with Antidiagonal Scoring
Ruyi Xu, Guangxuan Xiao, Haofeng Huang et al.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu et al.
Gemma 3 Technical Report
Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak et al.
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman et al.
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Lijie Yang, Zhihao Zhang, Zhuofu Chen et al.