On Subquadratic Architectures: From Applications to Principles

TL;DR

This study compares xLSTM, Mamba-2, and Gated DeltaNet architectures, demonstrating xLSTM's superior performance in complex sequence tasks due to its robust state tracking and memory accumulation.

cs.LG 🔴 Advanced 2026-06-11 64 views
Anamaria-Roberta Hartl Levente Zólyomi David Stap Pieter-Jan Hoedt Niklas Schmidinger Lukas Hauzenberger Sebastian Böck Günter Klambauer Sepp Hochreiter
subquadratic architectures sequence modeling memory mechanisms deep learning model comparison

Key Findings

Methodology

The paper introduces a unified formulation of xLSTM, Mamba-2, and Gated DeltaNet, expressing their state update and memory mechanisms in a common framework. Extensive experiments across code pretraining, model distillation, and time-series forecasting evaluate their performance on tasks with complex dependencies, including synthetic length generalization tasks. The methodology involves: • Deriving a unified mathematical representation of the architectures’ gating and memory update rules; • Conducting end-to-end training and evaluation on datasets like HumanEval, PIQA, ARC, and GIFT-Eval; • Analyzing the architectures’ ability to handle long-range dependencies and structured information. The evaluation compares metrics such as pass@k for code generation, accuracy on reasoning benchmarks, and forecasting errors, providing insights into how architectural differences influence performance in complex scenarios.

Key Results

  • In code generation tasks, xLSTM[7:1] outperformed Mamba-2 and Gated DeltaNet across all pass@k metrics, achieving a 1.81 percentage point improvement at pass@64 (from approximately 80% to 81.81%) on the HumanEval dataset at 100B token training. Its advantage persisted across different data scales and training configurations, demonstrating robustness in handling long-distance code dependencies.
  • In model distillation experiments, xLSTM[1:0] as a plug-in operator achieved an average pass@1 score of 0.768 across four code benchmarks, surpassing Gated DeltaNet’s 0.755, indicating superior transferability of learned representations. In time-series forecasting, xLSTM[3:1] achieved the lowest MASE and CRPS scores on GIFT-Eval across multiple parameter scales, especially at small to mid-sized models, confirming its effectiveness in modeling long temporal dependencies.
  • Synthetic tasks designed for length generalization and state tracking validated the core hypothesis: xLSTM’s gating mechanism enables more flexible and stable memory correction, allowing it to accurately count and track states beyond training sequence lengths. Mamba-2 and Gated DeltaNet showed limited ability in these tasks, highlighting their weaker accumulation and tracking capabilities.

Significance

This research underscores the potential of subquadratic architectures, particularly xLSTM, to address the computational challenges faced by transformers in long sequence modeling. By revealing the mechanisms behind xLSTM’s superior performance—namely, its effective state tracking and memory accumulation—it paves the way for scalable, efficient models capable of handling complex structured data such as code and time series. These findings have profound implications for both academic research and industrial applications, offering a pathway to deploy high-performance sequence models with lower computational costs, thus broadening the scope of real-world AI systems.

Technical Contribution

The paper’s main technical contribution is the development of a unified mathematical framework that captures the core differences among xLSTM, Mamba-2, and Gated DeltaNet in their gating and memory update mechanisms. This framework clarifies how these architectures differ primarily in their ability to accumulate information and track states over sequences. The authors also introduce synthetic length generalization tasks to empirically validate the hypothesis that robust state tracking and accumulation are key to performance in complex dependencies. Additionally, the study demonstrates the effectiveness of xLSTM as a plug-in operator in both pretraining and distillation scenarios, broadening its applicability in scalable sequence modeling.

Novelty

This work is the first comprehensive head-to-head comparison of xLSTM, Mamba-2, and Gated DeltaNet across multiple complex tasks, providing a clear performance hierarchy and mechanistic insights. The introduction of a unified formulation to explain their differences, especially highlighting the importance of accumulation and state tracking, represents a significant advancement in understanding subquadratic sequence architectures. The synthetic tasks designed for length generalization and state tracking further contribute novel empirical evidence supporting the central hypothesis, filling a critical gap in the literature on scalable sequence models.

Limitations

  • While xLSTM demonstrates superior performance in many tasks, its gating mechanism may still face challenges such as gradient vanishing or information loss in extremely long sequences or high-dimensional states. Further regularization or architectural enhancements are needed to address these issues.
  • The computational efficiency of xLSTM, though better than transformers, still lags behind in hardware utilization compared to optimized matrix multiplication hardware, especially for very large models, limiting its immediate deployment in resource-constrained environments.
  • Current evaluations focus on specific datasets and synthetic tasks; the generalization to broader real-world applications, such as multimodal data or highly noisy environments, remains to be thoroughly tested.

Future Work

Future research will explore integrating multi-scale memory modules and hierarchical gating mechanisms to further improve long sequence handling. Extending the architecture to multimodal and multi-task learning settings is also promising. Hardware optimization for efficient implementation of xLSTM on modern accelerators could significantly enhance practical deployment. Moreover, theoretical analysis of information retention and forgetting in these models will deepen understanding, guiding further architectural innovations. Expanding evaluations to more diverse datasets and real-world scenarios will be crucial for establishing broader applicability.

AI Executive Summary

Transformers, despite其在序列建模中的卓越表现,因其二次复杂度的注意力机制而面临显著的计算瓶颈。随着模型规模不断扩大,如何在保持性能的同时降低计算成本,成为深度学习研究的核心问题。子二次架构作为一种潜在的解决方案,近年来引起了学界和工业界的广泛关注。本文系统比较了三种代表性架构:xLSTM、Mamba-2和Gated DeltaNet,旨在揭示它们在处理复杂依赖关系中的性能差异及其机制基础。

首先,作者提出了一个统一的架构表达框架,将这三种模型的状态更新和记忆机制抽象为门控记忆单元。通过在代码预训练、模型蒸馏和时间序列预测等多任务场景中进行端到端评估,发现xLSTM在大多数任务中表现最优,尤其在长距离依赖和结构化信息的处理上具有明显优势。其核心在于门控机制实现的稳健状态追踪与记忆累积能力,有效应对复杂依赖关系。

为了验证这一机制假设,研究设计了合成的长度泛化和状态追踪任务。结果显示,xLSTM在超出训练长度的计数和状态更新任务中表现出色,远超Mamba-2和Gated DeltaNet,验证了其在复杂依赖建模中的机制优势。这一发现不仅丰富了子二次架构的理论理解,也为未来设计提供了指导。

在实际应用中,作者还将这些架构应用到模型蒸馏和时间序列预测中。结果显示,xLSTM作为子架构在从教师模型蒸馏和多参数规模的时间序列任务中均优于其他架构,证明其在迁移学习和实际场景中的潜力。整体来看,该研究不仅提供了架构性能的实证证据,也揭示了其背后的机制原理,为低复杂度高性能序列模型的发展提供了新思路。

未来,研究将聚焦于结合多尺度记忆机制、优化硬件实现,以及扩展到多模态、多任务环境中,推动子二次架构在工业界的广泛应用。该工作为深度学习在长序列建模中的发展提供了坚实的基础,也为未来模型设计的理论探索开启了新篇章。

Deep Dive

Abstract

Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.

cs.LG