IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

TL;DR

IndexCache accelerates sparse attention by reusing cross-layer indices, reducing 75% of computations, achieving 1.82x speedup.

cs.CL πŸ”΄ Advanced 2026-03-13 13 views
Yushi Bai Qian Dong Ting Jiang Xin Lv Zhengxiao Du Aohan Zeng Jie Tang Juanzi Li
sparse attention cross-layer index deep learning efficient computation long-context processing

Key Findings

Methodology

IndexCache partitions layers into Full layers, which retain their indexers, and Shared layers, which reuse the top-k indices from the nearest Full layer. Two complementary approaches are proposed: training-free IndexCache uses a greedy search algorithm to directly minimize language modeling loss, while training-aware IndexCache introduces a multi-layer distillation loss to train each retained indexer against the averaged attention distributions of all layers it serves.

Key Results

  • On a 30B DSA model, IndexCache removes 75% of indexer computations with negligible quality degradation, achieving up to 1.82x prefill speedup and 1.48x decode speedup compared to standard DSA.
  • Preliminary experiments on the production-scale GLM-5 model further confirm these positive results, with IndexCache removing 50% of indexer computations while maintaining comparable performance across both long-context and reasoning tasks.
  • Experiments show that IndexCache improves decode throughput by 22-51% at a 200K context length, with significant gains at longer contexts.

Significance

IndexCache significantly reduces computational overhead in long-context inference, particularly in scenarios requiring efficient processing of large-scale data. By reducing the computational complexity of indexers, it provides a new solution for improving inference efficiency in large-scale language models, addressing the bottleneck of sparse attention in long-context applications.

Technical Contribution

IndexCache significantly reduces the computational complexity of sparse attention through cross-layer index reuse. Unlike existing methods, it does not rely on full attention layers as anchors for indexers but achieves efficient top-k selection through a lightweight indexer. Its innovative training-aware distillation loss offers new possibilities for model training and optimization.

Novelty

IndexCache is the first to achieve cross-layer index reuse in sparse attention, significantly reducing computational overhead. Compared to existing methods, it achieves efficient top-k selection through a lightweight indexer without relying on full attention layers.

Limitations

  • At extreme sparsity (retaining only 1/8 of indexer layers), there is a significant drop in long-context performance, indicating that indexer reuse may lead to quality degradation in some cases.
  • While IndexCache performs well in most cases, further tuning may be required for specific tasks to ensure performance is not affected.
  • Current experiments focus primarily on specific models and tasks, and its generalizability to broader application scenarios has yet to be verified.

Future Work

Future research directions include applying training-aware IndexCache to larger-scale models and exploring its performance across different tasks and datasets. Additionally, further optimization of indexer selection and reuse strategies could be studied to enhance model adaptability and performance.

AI Executive Summary

Long-context inference is a critical application scenario for modern large-scale language models, and sparse attention is an effective method to address this challenge. Traditional sparse attention mechanisms, such as DeepSeek Sparse Attention (DSA), use a lightweight indexer to select the top-k most relevant tokens for each query, reducing core attention computation from O(L^2) to O(Lk). However, the indexer itself retains O(L^2) complexity and must run independently at every layer, despite the high similarity of top-k selections across layers.

IndexCache addresses this redundancy by reusing cross-layer indices. It partitions layers into a small set of Full layers that retain their indexers and a majority of Shared layers that reuse the top-k indices from the nearest Full layer. Two complementary approaches are proposed to determine and optimize this configuration: training-free IndexCache applies a greedy search algorithm to directly minimize language modeling loss, while training-aware IndexCache introduces a multi-layer distillation loss to train each retained indexer against the averaged attention distributions of all layers it serves.

Experimental results show that on a 30B parameter DSA model, IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82x prefill speedup and 1.48x decode speedup compared to standard DSA. These positive results are further confirmed by preliminary experiments on the production-scale GLM-5 model, where IndexCache removes 50% of indexer computations while maintaining comparable performance across both long-context and reasoning tasks.

The significance of IndexCache lies in its ability to significantly reduce computational overhead in long-context inference, particularly in scenarios requiring efficient processing of large-scale data. By reducing the computational complexity of indexers, it provides a new solution for improving inference efficiency in large-scale language models, addressing the bottleneck of sparse attention in long-context applications.

However, at extreme sparsity (retaining only 1/8 of indexer layers), there is a significant drop in long-context performance, indicating that indexer reuse may lead to quality degradation in some cases. Future research directions include applying training-aware IndexCache to larger-scale models and exploring its performance across different tasks and datasets. Additionally, further optimization of indexer selection and reuse strategies could be studied to enhance model adaptability and performance.

Deep Analysis

Background

The self-attention mechanism is a cornerstone of modern large-scale language models, yet its quadratic complexity in sequence length presents a fundamental bottleneck for long-context inference. As large-scale language models are increasingly deployed in settings that demand extended contexts, such as long chain-of-thought reasoning, multi-step agentic workflows, and retrieval-augmented generation over web-scale sources, reducing attention cost without sacrificing model quality has become a critical research problem. Sparse attention offers a principled solution: instead of attending to all preceding tokens, each query selects only the most relevant subset. Among recent approaches, DeepSeek Sparse Attention (DSA) stands out as a production-grade trainable sparse attention mechanism. For sparse token selection, DSA introduces an additional lightweight indexer that scores all preceding tokens and selects the top-k for the subsequent core attention. This reduces per-layer core attention from O(L^2) to O(Lk) while preserving model quality through continued pre-training. However, the indexer itself still operates at O(L^2) complexity and must independently score all preceding tokens at every layer, which becomes a significant fraction of the total attention budget at long context lengths.

Core Problem

In long-context inference, the computational cost of sparse attention's indexer becomes a bottleneck. Despite the high similarity of top-k selections across layers, each layer's indexer must run independently, leading to substantial redundant computations. How to effectively leverage the cross-layer stability of index selections to reduce unnecessary indexer computations while maintaining model quality is a pressing issue.

Innovation

IndexCache significantly reduces the computational complexity of sparse attention through cross-layer index reuse. Key innovations include:

1. Layer partitioning strategy: layers are divided into Full layers, which retain their indexers, and Shared layers, which reuse the top-k indices from the nearest Full layer.

2. Training-free optimization: a greedy search algorithm is proposed to select which layers retain indexers by directly minimizing language modeling loss.

3. Training-aware optimization: a multi-layer distillation loss is introduced to train each retained indexer against the averaged attention distributions of all layers it serves.

Methodology

The methodology of IndexCache includes the following steps:

  • οΏ½οΏ½ Layer partitioning: divide model layers into Full layers, which retain their indexers, and Shared layers, which reuse the top-k indices from the nearest Full layer.
  • οΏ½οΏ½ Training-free optimization: apply a greedy search algorithm to select which layers retain indexers by directly minimizing language modeling loss.
  • οΏ½οΏ½ Training-aware optimization: introduce a multi-layer distillation loss to train each retained indexer against the averaged attention distributions of all layers it serves.
  • οΏ½οΏ½ Experimental validation: conduct experiments on a 30B parameter DSA model and a production-scale GLM-5 model to validate the effectiveness of IndexCache.

Experiments

The experimental design includes validation on a 30B parameter DSA model and a production-scale GLM-5 model. Benchmark datasets used include OpenAI's GraphWalks, LongBench v2, RULER, and AA-LCR, as well as four general and reasoning benchmarks: AIME 2025, GPQA-Diamond, LiveCodeBench v6, and IFBench. The experiments compare the original DSA baseline against IndexCache at two retention ratios: 1/2 (half of the indexer layers retained) and 1/4 (a quarter retained).

Results

Experimental results show that on a 30B DSA model, IndexCache removes 75% of indexer computations with negligible quality degradation, achieving up to 1.82x prefill speedup and 1.48x decode speedup compared to standard DSA. Preliminary experiments on the production-scale GLM-5 model further confirm these positive results, with IndexCache removing 50% of indexer computations while maintaining comparable performance across both long-context and reasoning tasks. Additionally, IndexCache improves decode throughput by 22-51% at a 200K context length, with significant gains at longer contexts.

Applications

Application scenarios for IndexCache include:

1. Long-context inference in large-scale language models, particularly in scenarios requiring efficient processing of large-scale data.

2. Improving inference efficiency and reducing computational costs in real-time online services.

3. Deploying large-scale models in resource-constrained environments, reducing computational resource consumption.

Limitations & Outlook

While IndexCache performs well in most cases, at extreme sparsity (retaining only 1/8 of indexer layers), there is a significant drop in long-context performance. Additionally, current experiments focus primarily on specific models and tasks, and its generalizability to broader application scenarios has yet to be verified. Future research directions include applying training-aware IndexCache to larger-scale models and exploring its performance across different tasks and datasets.

Plain Language Accessible to non-experts

Imagine you're shopping in a large supermarket. There are thousands of products, and each time you shop, you need to find the items you need on the shelves. The traditional method is to browse all the shelves each time to find what you need, which is like the full attention mechanism that processes all possible options. However, this is inefficient, especially when the supermarket is large.

Now, suppose the supermarket offers a smart shopping assistant that, based on your shopping list and history, pre-selects the most relevant products for you and tells you their locations when you arrive. This is like the sparse attention mechanism, which focuses only on the most relevant options, saving a lot of time and effort.

However, this smart assistant recalculates the locations of all products each time you shop, even if their locations haven't changed much. IndexCache is like a memory function that remembers the product locations from the last shopping trip and reuses this information the next time, updating only when necessary. This greatly reduces the assistant's computation and improves shopping efficiency.

In this way, IndexCache helps save computational resources when processing large amounts of data, significantly improving efficiency, especially in scenarios requiring quick responses.

ELI14 Explained like you're 14

Imagine you're playing a massive multiplayer online game. There are many quests and challenges, each with many steps you need to complete to earn rewards. The traditional method is to start from scratch and complete each step every time, which is like the full attention mechanism that processes all possible options.

But this is inefficient, especially when the quests are complex. So, the game developers introduce a smart assistant that, based on your game history and current quest, selects the most relevant steps for you and guides you through the quest. This is like the sparse attention mechanism, which focuses only on the most relevant options, saving a lot of time and effort.

However, this assistant recalculates all the steps each time, even if they haven't changed much. IndexCache is like a memory function that remembers the steps from the last quest and reuses this information the next time, updating only when necessary. This greatly reduces the assistant's computation and improves game efficiency.

In this way, IndexCache helps save computational resources when handling many tasks, significantly improving efficiency, especially in scenarios requiring quick responses.

Glossary

Sparse Attention

Sparse attention is a mechanism that focuses only on the most relevant subset, reducing computational complexity.

Used in IndexCache to select the top-k most relevant tokens for each query.

Indexer

An indexer is a component used to score and select the most relevant tokens, determining the computational efficiency of sparse attention.

In DSA, the indexer scores all preceding tokens and selects the top-k.

Cross-Layer Index Reuse

Cross-layer index reuse refers to sharing index results between different layers to reduce redundant computations.

IndexCache reduces 75% of indexer computations through cross-layer index reuse.

Greedy Search Algorithm

A greedy search algorithm is an optimization method that incrementally selects the optimal solution to minimize loss.

Used in training-free IndexCache to select which layers retain indexers.

Multi-Layer Distillation Loss

A multi-layer distillation loss is a training strategy that trains indexers to match the averaged attention distributions of all layers they serve.

Used in training-aware IndexCache to train retained indexers.

Full Layer

A Full layer is a layer that retains its indexer and computes fresh top-k indices.

In IndexCache, Full layers retain their own indexers.

Shared Layer

A Shared layer is a layer that reuses the top-k indices from the nearest Full layer, reducing indexer computations.

In IndexCache, Shared layers reuse the top-k indices from the nearest Full layer.

Long-Context Inference

IndexCache improves long-context inference efficiency by reducing indexer computations.

DeepSeek Sparse Attention (DSA)

DSA is a production-grade trainable sparse attention mechanism that uses a lightweight indexer to select the top-k most relevant tokens.

IndexCache is validated on a DSA model.

GLM-5 Model

The GLM-5 model is a production-scale large language model on which preliminary experiments of IndexCache were conducted.

Used to verify the effectiveness of IndexCache in production environments.

Open Questions Unanswered questions from this research

  • 1 How can IndexCache be applied to larger-scale models and verified across different tasks and datasets? Current experiments focus primarily on specific models and tasks, and its generalizability to broader application scenarios has yet to be verified.
  • 2 How can indexer selection and reuse strategies be further optimized under extreme sparsity to enhance model adaptability and performance?
  • 3 Can the cross-layer index reuse method of IndexCache be applied to other types of attention mechanisms to improve computational efficiency?
  • 4 How can the computational complexity of indexers be further reduced without affecting model quality?
  • 5 How do hardware and software environments impact the performance and efficiency of IndexCache in practical applications?

Applications

Immediate Applications

Real-Time Online Services

By reducing computational overhead, IndexCache can improve inference efficiency and reduce computational costs in real-time online services requiring quick responses.

Large-Scale Data Processing

In scenarios requiring the processing of large-scale data, IndexCache improves long-context inference efficiency by reducing indexer computational complexity.

Model Deployment in Resource-Constrained Environments

In environments with limited computational resources, IndexCache can reduce resource consumption, making large-scale model deployment more feasible.

Long-term Vision

Proliferation of Large-Scale Language Models

By improving inference efficiency, IndexCache is expected to promote the proliferation of large-scale language models in more application scenarios.

Optimization of Smart Assistants

The cross-layer index reuse method of IndexCache can be used to optimize the computational efficiency of smart assistants, improving user experience.

Abstract

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

cs.CL cs.LG

References (20)

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, A. Liu, Aoxue Mei et al.

2025 222 citations ⭐ Influential View Analysis β†’

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Xunhao Lai, Jianqiao Lu, Yao Luo et al.

2025 74 citations ⭐ Influential View Analysis β†’

Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads

Zhoutong Wu, Yuan Zhang, Yiming Dong et al.

2025 1 citations ⭐ Influential View Analysis β†’

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang et al.

2025 3 citations View Analysis β†’

Kimi Linear: An Expressive, Efficient Attention Architecture

Yu Zhang, Zongyu Lin, Xingcheng Yao et al.

2025 33 citations View Analysis β†’

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Shang Yang, Junxian Guo, Haotian Tang et al.

2025 28 citations View Analysis β†’

SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference

Jintao Zhang, Chendong Xiang, Haofeng Huang et al.

2025 49 citations View Analysis β†’

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhenyu (Allen) Zhang, Ying Sheng, Tianyi Zhou et al.

2023 561 citations View Analysis β†’

GLM-5: from Vibe Coding to Agentic Engineering

GLM-4.5 Team Aohan Zeng, Xin Lv, Zhenyu Hou et al.

2026 10 citations View Analysis β†’

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Z. M. K. Zuhri, Farid Adilazuarda, Ayu Purwarianti et al.

2024 17 citations View Analysis β†’

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao, Albert Gu

2024 1245 citations View Analysis β†’

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Yizhao Gao, Zhichen Zeng, Dayou Du et al.

2024 90 citations View Analysis β†’

InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation

Weilin Zhao, Zihan Zhou, Zhou Su et al.

2025 12 citations View Analysis β†’

OmniKV: Dynamic Context Selection for Efficient Long-Context LLMs

Jitai Hao, Yuke Zhu, Tianjian Wang et al.

2025 23 citations

MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax, Aonian Li, Bangwei Gong et al.

2025 118 citations View Analysis β†’

XAttention: Block Sparse Attention with Antidiagonal Scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang et al.

2025 81 citations View Analysis β†’

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu et al.

2024 1187 citations View Analysis β†’

Gemma 3 Technical Report

Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak et al.

2025 1087 citations View Analysis β†’

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman et al.

2024 704 citations View Analysis β†’

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Lijie Yang, Zhihao Zhang, Zhuofu Chen et al.

2024 19 citations View Analysis β†’