Mixture-of-Depths Attention

TL;DR

Mixture-of-Depths Attention (MoDA) improves downstream task performance by 2.11% on a 1.5B-parameter model with only a 3.7% increase in FLOPs.

cs.CL 🔴 Advanced 2026-03-17 67 views

Lianghui Zhu Yuxin Fang Bencheng Liao Shijie Wang Tianheng Cheng Zilong Huang Chen Chen Lai Wei Yutao Zeng Ya Wang Yi Lin Yu Li Xinggang Wang

AI Reader Arxiv Page Download PDF

deep learning large language models attention mechanism signal degradation hardware efficiency

Key Findings

Methodology

This paper introduces a novel attention mechanism called Mixture-of-Depths Attention (MoDA), which allows each attention head to attend to both sequence KV pairs at the current layer and depth KV pairs from preceding layers. This approach mitigates information dilution without significant computational overhead. To enhance hardware efficiency, the authors developed a hardware-aware implementation that achieves 97.3% of FlashAttention-2's efficiency at a sequence length of 64K.

Key Results

MoDA on a 1.5B-parameter model reduces average perplexity by 0.2 across 10 validation benchmarks and increases average downstream performance by 2.11%, with only a 3.7% increase in FLOPs.
Experiments show that combining MoDA with post-norm yields better performance than using it with pre-norm, indicating its potential for depth scaling.
On the C4 validation set, models using MoDA outperform the OLMo2 baseline, achieving lower validation loss and better downstream performance on tasks like HellaSwag and ARC-Challenge.

Significance

The introduction of MoDA provides a new approach to depth scaling, addressing the common problem of information dilution in modern Transformer models. By allowing the attention mechanism to access deeper historical information, MoDA enhances model performance without significant computational cost. This method holds significant academic importance and offers new possibilities for training and deploying large-scale language models in the industry.

Technical Contribution

MoDA offers a novel attention mechanism that integrates sequence and depth attention into a unified operation, addressing the issue of information dilution in modern large language models. Compared to existing residual and dense connections, MoDA provides a more efficient depth information retrieval mechanism while maintaining hardware friendliness. Additionally, its hardware-aware implementation significantly improves efficiency on GPUs.

Novelty

MoDA is the first to combine sequence and depth attention in a unified mechanism, allowing each layer to adaptively read useful states from earlier layers. This approach differs from traditional fixed-pattern aggregation, offering a data-dependent dynamic mixing method that effectively addresses information dilution.

Limitations

MoDA may still face information overload issues in extremely deep models, as effectively integrating information in very deep networks remains challenging despite its design to reduce information dilution.
While MoDA has been optimized for hardware efficiency, further adjustments may be needed to achieve optimal performance on specific hardware architectures.
The complexity of implementing MoDA might pose a learning barrier for novice researchers, especially in terms of hardware-aware implementation.

Future Work

Future research could focus on further optimizing MoDA's hardware implementation to accommodate different hardware architectures. Additionally, exploring MoDA's application in other types of neural networks, particularly those requiring long sequences or deep structures, could be beneficial. Researchers might also consider combining MoDA with other advanced attention mechanisms to further enhance model performance.

AI Executive Summary

In recent years, large language models (LLMs) have made significant strides in the field of natural language processing. However, as these models grow deeper, the problem of information dilution becomes increasingly severe. This phenomenon is particularly evident in modern Transformer architectures, where informative features formed in earlier layers are gradually diluted by repeated residual updates, leading to a decline in model performance.

To address this issue, the paper introduces a novel attention mechanism called Mixture-of-Depths Attention (MoDA). MoDA allows each attention head to attend to both sequence KV pairs at the current layer and depth KV pairs from preceding layers, thereby mitigating information dilution without significant computational overhead. The authors also developed a hardware-aware implementation that achieves 97.3% of FlashAttention-2's efficiency at a sequence length of 64K.

The core technical principle of MoDA lies in its data-dependent dynamic mixing approach. By integrating sequence and depth attention into a unified operation, MoDA can adaptively read useful states from earlier layers, effectively addressing the issue of information dilution. This method differs from traditional fixed-pattern aggregation, offering a more flexible and efficient mechanism.

The introduction of MoDA provides a new approach to depth scaling, addressing the common problem of information dilution in modern Transformer models. This method holds significant academic importance and offers new possibilities for training and deploying large-scale language models in the industry.

However, MoDA may still face information overload issues in extremely deep models. Future research could focus on further optimizing MoDA's hardware implementation to accommodate different hardware architectures and exploring its application in other types of neural networks.

Deep Analysis

Background

Large language models (LLMs) have recently achieved remarkable progress in natural language processing, driven by the continuous expansion of model scale, including context length, training data, model width, and depth. However, as model depth increases, the problem of information dilution becomes more severe. This phenomenon is particularly evident in modern Transformer architectures, where informative features formed in earlier layers are gradually diluted by repeated residual updates, leading to a decline in model performance. Traditional residual pathways, such as ResNet-style connections, improve optimization stability in deep networks to some extent but fail to effectively address information dilution. To tackle this challenge, researchers have attempted various methods, such as dense cross-layer connections (DenseNet-style), but their substantial parameter growth limits their application in large-scale language models.

Core Problem

As the depth of large language models increases, the issue of information dilution becomes increasingly severe. This phenomenon is particularly evident in modern Transformer architectures, where informative features formed in earlier layers are gradually diluted by repeated residual updates, leading to a decline in model performance. The core problem lies in how to maintain optimization stability while preventing information dilution, thereby fully utilizing the representational capacity of deep models.

Innovation

This paper introduces a novel attention mechanism called Mixture-of-Depths Attention (MoDA) to address the problem of information dilution in modern Transformer models. The core innovation of MoDA lies in its data-dependent dynamic mixing approach, allowing each attention head to attend to both sequence KV pairs at the current layer and depth KV pairs from preceding layers. This method differs from traditional fixed-pattern aggregation, offering a more flexible and efficient mechanism. Additionally, the authors developed a hardware-aware implementation that achieves 97.3% of FlashAttention-2's efficiency at a sequence length of 64K.

Methodology

�� MoDA Mechanism: Allows each attention head to attend to both sequence KV pairs at the current layer and depth KV pairs from preceding layers, thereby mitigating information dilution.

�� Hardware-aware Implementation: Developed a hardware-aware implementation that achieves 97.3% of FlashAttention-2's efficiency at a sequence length of 64K.

�� Data-dependent Dynamic Mixing: By integrating sequence and depth attention into a unified operation, MoDA can adaptively read useful states from earlier layers.

�� Post-norm Combination: Experiments show that combining MoDA with post-norm yields better performance than using it with pre-norm.

Experiments

Experiments were conducted on a 1.5B-parameter model using the 400B-token OLMo2 dataset. The model was tested on 10 validation benchmarks, including C4, HellaSwag, WinoGrande, and ARC-Challenge. The experiments also included comparisons with the baseline model OLMo2 and ablation studies to verify the effectiveness of MoDA. Key hyperparameters included a sequence length of 64K, model width of 1024, and GQA group size of 2.

Results

Experimental results demonstrate that MoDA, on a 1.5B-parameter model, reduces average perplexity by 0.2 across 10 validation benchmarks and increases average downstream performance by 2.11%, with only a 3.7% increase in FLOPs. Additionally, combining MoDA with post-norm yields better performance than using it with pre-norm, indicating its potential for depth scaling. On the C4 validation set, models using MoDA outperform the OLMo2 baseline, achieving lower validation loss and better downstream performance on tasks like HellaSwag and ARC-Challenge.

Applications

MoDA holds significant value for training and deploying large-scale language models. Its hardware-aware implementation improves efficiency on GPUs, making it suitable for tasks requiring long sequences and deep structures. Additionally, MoDA's dynamic mixing approach can enhance performance in information retrieval systems without significant computational cost, making it applicable to scenarios requiring complex information processing.

Limitations & Outlook

While MoDA has been optimized for hardware efficiency, further adjustments may be needed to achieve optimal performance on specific hardware architectures. Additionally, MoDA may still face information overload issues in extremely deep models. Future research could focus on further optimizing MoDA's hardware implementation to accommodate different hardware architectures and exploring its application in other types of neural networks.

Plain Language Accessible to non-experts

Imagine you're in a massive library trying to find a specific book. The traditional method is to search layer by layer, which might cause you to miss some important information. MoDA is like having a smart robot assistant that not only helps you find information in the current layer but also revisits information from previous layers, ensuring you don't miss any crucial details. MoDA solves the problem of information dilution by giving you a pair of wise eyes in the library, allowing you to find the information you need more efficiently. Moreover, this robot is highly efficient, helping you find books faster without adding much workload. MoDA's design makes it excel in processing complex information, like an experienced librarian who can quickly find the most valuable content in a vast sea of information.

ELI14 Explained like you're 14

Hey there! Have you ever thought about how, when you're playing a super complex game, you can quickly find the secret to winning? This is like how our brains process information without missing important details. MoDA is like a super helper in the game, helping you find information in the current level and revisiting hints from previous levels, ensuring you don't miss any crucial clues. MoDA solves the problem of information dilution by giving you a pair of wise eyes in the game, allowing you to find the secret to winning more efficiently. Plus, this helper is super efficient, helping you win faster without adding much workload. MoDA's design makes it excel in processing complex information, like an experienced gamer who can quickly find the most valuable content in a vast sea of information.

Glossary

Mixture-of-Depths Attention (MoDA)

A novel attention mechanism that allows each attention head to attend to both sequence KV pairs at the current layer and depth KV pairs from preceding layers, mitigating information dilution.

MoDA is used in this paper to address the issue of information dilution in large language models.

Large Language Models (LLMs)

Deep learning-based natural language processing models with billions of parameters capable of handling complex language tasks.

The paper investigates the problem of information dilution in LLMs during depth scaling.

Information Dilution

As model depth increases, informative features formed in earlier layers are gradually diluted by repeated residual updates, leading to a decline in model performance.

MoDA addresses information dilution by allowing the attention mechanism to access deeper historical information.

Residual Connection

A network structure that adds direct skip connections between layers to help alleviate the vanishing gradient problem in deep networks.

Traditional residual connections improve optimization stability in deep networks to some extent.

Hardware-aware Implementation

A method of optimizing an algorithm's efficiency on specific hardware architectures, typically by adjusting memory access patterns and computation order.

MoDA's hardware-aware implementation achieves high efficiency on GPUs.

Sequence KV Pairs

In attention mechanisms, sequence key-value pairs are used to compute attention weights, determining the importance of each input element.

MoDA allows attention heads to attend to both sequence KV pairs at the current layer and depth KV pairs from preceding layers.

Depth KV Pairs

In MoDA, depth KV pairs are key-value pairs extracted from preceding layers to mitigate information dilution.

MoDA accesses depth KV pairs to improve information dilution.

Post-norm

A normalization technique typically applied after the output of an attention mechanism or other network layers to improve model stability and performance.

Experiments show that MoDA combined with post-norm outperforms pre-norm.

Pre-norm

A normalization technique typically applied before the input of an attention mechanism or other network layers to improve model stability and performance.

MoDA combined with post-norm outperforms pre-norm.

FlashAttention-2

An efficient attention mechanism implementation designed to improve computational efficiency for long sequence processing.

MoDA achieves 97.3% of FlashAttention-2's efficiency at a sequence length of 64K.

Open Questions Unanswered questions from this research

1 How to effectively integrate information in extremely deep models to avoid information overload? Although MoDA addresses information dilution to some extent, effectively integrating information in very deep networks remains challenging.
2 How to further optimize MoDA's hardware implementation to accommodate different hardware architectures? While MoDA performs well on GPUs, further adjustments may be needed for other hardware architectures.
3 Can MoDA be applied to other types of neural networks, particularly those requiring long sequences or deep structures?
4 Would combining MoDA with other advanced attention mechanisms further enhance model performance?
5 How to reduce the learning complexity of implementing MoDA, making it more accessible for novice researchers?

Applications

Immediate Applications

Large-scale Language Model Training

MoDA can enhance the performance of large-scale language models without significant computational cost, suitable for scenarios requiring complex information processing.

Long Sequence Processing

MoDA's hardware-aware implementation improves efficiency on GPUs, making it suitable for tasks requiring long sequences and deep structures.

Information Retrieval Systems

MoDA's dynamic mixing approach can enhance the performance of information retrieval systems without significant computational cost, suitable for scenarios requiring complex information processing.

Long-term Vision

Intelligent Assistants

MoDA can provide intelligent assistants with more efficient information processing capabilities, enabling faster user response and more accurate answers.

Autonomous Driving Systems

MoDA can provide autonomous driving systems with more efficient information processing capabilities, enabling faster environmental response and more accurate decision-making.

Abstract

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

cs.CL cs.AI

References (20)

Deep Residual Learning for Image Recognition

Kaiming He, X. Zhang, Shaoqing Ren et al.

2015 223012 citations ⭐ Influential View Analysis →

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Matteo Pagliardini, Amirkeivan Mohtashami, F. Fleuret et al.

2024 24 citations ⭐ Influential View Analysis →

Densely Connected Convolutional Networks

Gao Huang, Zhuang Liu, Kilian Q. Weinberger

2016 42112 citations ⭐ Influential View Analysis →

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini et al.

2024 172 citations ⭐ Influential View Analysis →

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Adam Suma, Sam Dauncey

2025 1788 citations

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni et al.

2018 4188 citations View Analysis →

DeepNet: Scaling Transformers to 1,000 Layers

Hongyu Wang, Shuming Ma, Li Dong et al.

2022 230 citations View Analysis →

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan, Andrew Zisserman

2014 109925 citations View Analysis →

Hyper-Connections

Defa Zhu, Hongzhi Huang, Zihao Huang et al.

2024 39 citations View Analysis →

The SciQA Scientific Question Answering Benchmark for Scholarly Knowledge

S. Auer, D. Barone, Cassiano Bartz et al.

2023 77 citations

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen et al.

2023 1501 citations View Analysis →

Scaling Laws for Neural Language Models

J. Kaplan, Sam McCandlish, T. Henighan et al.

2020 7294 citations View Analysis →

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang et al.

2019 4233 citations View Analysis →

Dual Path Networks

Yunpeng Chen, Jianan Li, Huaxin Xiao et al.

2017 887 citations View Analysis →

Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

Chen Chen, Lai Wei

2026 2 citations View Analysis →

mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huan Cao et al.

2025 24 citations View Analysis →

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Luca Soldaini, Rodney Kinney, Akshita Bhagia et al.

2024 432 citations View Analysis →

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani et al.

2017 931 citations View Analysis →

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang et al.

2019 2204 citations View Analysis →

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, Ali Hatamizadeh

2024 215 citations View Analysis →

Mixture-of-Depths Attention

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Glossary

Mixture-of-Depths Attention (MoDA)

Large Language Models (LLMs)

Information Dilution

Residual Connection

Hardware-aware Implementation

Sequence KV Pairs

Depth KV Pairs

Post-norm

Pre-norm

FlashAttention-2

Open Questions Unanswered questions from this research

Applications

Immediate Applications

Large-scale Language Model Training

Long Sequence Processing

Information Retrieval Systems

Long-term Vision

Intelligent Assistants

Autonomous Driving Systems

Abstract

References (20)

Related Papers

Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

Improving Robustness of Tabular Retrieval via Representational Stability

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

CRAFT: Clustered Regression for Adaptive Filtering of Training data

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering