Sessa: Selective State Space Attention

TL;DR

Sessa enhances long-range memory by embedding selective attention in feedback paths.

cs.LG 🔴 Advanced 2026-04-21 27 views

Liubomyr Horbatko

selective attention state-space models long-range memory feedback path sequence modeling

Key Findings

Methodology

Sessa is a decoder architecture that embeds selective attention mechanisms within feedback paths. By incorporating attention within feedback paths, Sessa enables multi-path aggregation within a layer, enhancing long-range memory capabilities. Sessa is designed to operate under regimes with a power-law memory tail in lag ℓ of order Θ(ℓ^{-β}), where 0 < β < 1. This mechanism performs well in diffuse uniform-routing settings, allowing flexible selective retrieval.

Key Results

Sessa outperforms Transformer and Mamba baselines on long-context benchmarks. Under matched architectures and training budgets, Sessa achieves the best performance in long-context tasks while remaining competitive in short-context language modeling.
In experiments, Sessa demonstrates its advantage in long-range sensitivity, maintaining a slower decay rate under diffuse routing conditions.
Sessa achieves flexible selective retrieval in non-decaying profiles, which other compared models fail to accomplish.

Significance

The introduction of Sessa provides a novel solution for long-context sequence modeling, particularly in tasks requiring long-range memory. By embedding attention mechanisms within feedback paths, Sessa overcomes the attention dilution problem in traditional Transformers under long contexts, while also addressing the rapid information decay challenge in state-space models. This innovation holds significant importance in academia and offers more efficient long-context processing methods for industry.

Technical Contribution

Sessa's technical contribution lies in its unique architectural design, combining selective attention with feedback paths to offer a new paradigm for sequence modeling. Compared to existing Transformer and Mamba models, Sessa excels in long-range memory and selective retrieval. Its implementation of a power-law memory tail provides new theoretical guarantees for long-range information processing and opens up new engineering possibilities.

Novelty

Sessa is the first model to introduce selective attention mechanisms within feedback paths, with its innovation lying in enabling multi-path aggregation within a layer to enhance long-range memory capabilities. Unlike traditional single-read or single-chain feedback propagation, Sessa offers a flexible selective retrieval mechanism.

Limitations

Sessa may incur increased computational costs due to complex feedback paths, especially in long-context tasks.
The model's performance in extreme long-range tasks still requires further validation and may face performance bottlenecks.
In certain specific application scenarios, Sessa's flexibility may lead to model overfitting.

Future Work

Future research directions include optimizing Sessa's computational efficiency, exploring its performance in a broader range of application scenarios, and further validating its performance in extreme long-range tasks. Researchers can also explore combining Sessa with other models to enhance its adaptability across different tasks.

AI Executive Summary

Modern sequence models are predominantly led by Transformers, where self-attention mixes information from the visible context in an input-dependent manner. However, when retrieval is not sharp and attention remains diffuse over an effective support, the influence of any individual token is diluted, especially in full-prefix settings where the influence of old tokens scales as O(1/ℓ). Structured state-space models process sequences recurrently through an explicit feedback path; selective variants like Mamba make this feedback input-dependent, yet when freeze time cannot be sustained over long intervals, their long-range sensitivity decays exponentially with lag. Existing architectures either retrieve from the past in a single read or propagate information through a single feedback chain. We introduce Sessa, a decoder that places attention inside a feedback path, enabling recurrent many-path aggregation within a layer. Under stated assumptions, Sessa admits regimes with a power-law memory tail in lag ℓ of order Θ(ℓ^{-β}), where 0 < β < 1, which is asymptotically slower than 1/ℓ; moreover, this rate is tight in an explicit diffuse uniform-routing setting where the influence is Θ(ℓ^{-β}). Under the same conditions, only Sessa among the compared model classes realizes flexible selective retrieval, including non-decaying profiles. Empirically, under matched architectures and training budgets, Sessa achieves the strongest performance on our long-context benchmarks while remaining competitive with Transformer and Mamba style baselines on short-context language modeling.

Long-context sequence modeling is central to modern foundation models across language, vision, speech, time series, and genomics. Despite the architectural flexibility of the foundation-model paradigm, state-of-the-art systems are still overwhelmingly based on the Transformer and its self-attention mechanism.

A useful lens is to describe modern sequence mixers by how they route information from the past and how they maintain memory over time. In many modern architectures, routing decisions are input-dependent: the model uses the current token and its context to decide which parts of the visible history to consult. Under this view, self-attention implements an input-dependent direct-read mechanism: at each position, it computes a query-dependent pattern of relevance over the visible context and uses it to read out information from selected past positions. This framing highlights attention’s key strength, a selection mechanism over variable support length, but also a structural limitation: the retrieval is performed in a single pass, without an internal feedback loop that would repeatedly incorporate past readouts into an evolving state. Separately, standard implementations are also computationally expensive at long contexts due to quadratic time/memory scaling.

In parallel, structured recurrent sequence models, especially state-space models (SSMs), which realize long-range dynamics through a latent state and an explicit feedback path, have re-emerged as a compelling alternative for long-context modeling. SSMs can be interpreted as modern descendants of classical dynamical systems and admit linear (or near-linear) scaling in sequence length. However, for information-dense discrete data, a persistent challenge is that stable feedback dynamics often exhibit rapid attenuation of distant information (commonly exponential forgetting), which can hinder integrating multiple far-apart evidence snippets under heavy distractors. Selective SSMs (e.g., Mamba) can conditionally slow this attenuation by modulating the effective transition (e.g., ssm,≈ on selected steps, “freeze time”), but this mechanism is input-dependent and can fail when relevant and irrelevant positions induce similar local representations, leading to preserving or overwriting the wrong content.

These perspectives suggest complementary long-context failure modes. Stable feedback dynamics can suffer from exponential forgetting. Attention, while input-dependent, can suffer from dilution: when attention mass is spread across a large effective support of competing tokens (e.g., many near-tied logits), individual weights, and thus per-token contributions and sensitivities, decrease roughly inversely with that support (often behaving like O(1/S_eff(t)), and in the worst case like O(1/ℓ) when the effective support grows proportionally with context length). In practice, both effects can limit reliable long-range evidence integration.

We introduce Sessa, a decoder architecture that injects input-dependent attention into a feedback (recurrent) path, combining direct-read input-dependent routing with stateful aggregation through the feedback channel. Viewed through a temporal routing lens, for a fixed source token and target position (lag ℓ = t − source), a single self-attention layer routes influence via a single routing step (a direct edge source → t), while chain-structured state-space recurrences propagate along the unique length-ℓ temporal chain. Sessa introduces route diversity within a single layer: its attention-induced feedback operator aggregates contributions over multiple internal routing depths (and, in dense patterns, many temporal paths), which can help sustain long-range sensitivity when routing is diffuse. Concretely, while self-attention corresponds to an input-dependent direct-read system (in the values), Sessa realizes an input-dependent feedback system: it maintains a latent state over unbounded horizons, while the feedback dynamics remain input-dependent via attention-based routing inside the loop (potentially over variable-support patterns). Intuitively, Sessa retains the representational benefits of recurrence for long-range accumulation while leveraging attention as an input-dependent mechanism within the feedback pathway.

Related architectural ideas have introduced recurrence or feedback into sequence modeling. These approaches span a variety of feedback constructions and are typically presented in architecture-specific terms. Our contribution is complementary but mathematically different: we propose a routing-induced systems perspective that separates how context produces routing/mixing coefficients from how those coefficients are composed over time, and we use this lens to relate input-dependent routing directly to long-context sensitivity and memory-decay behavior.

Deep Dive

Abstract

Modern sequence models are dominated by Transformers, where self-attention mixes information from the visible context in an input-dependent way. However, when retrieval is not sharp and attention remains diffuse over an effective support $S_{\mathrm{eff}}(t)$, the influence of any individual token is diluted, typically scaling as $O(1/S_{\mathrm{eff}}(t))$ and reaching $O(1/\ell)$ for old tokens in full-prefix settings. Structured state-space models process sequences recurrently through an explicit feedback path; selective variants such as Mamba make this feedback input-dependent, yet when freeze time cannot be sustained over long intervals, their long-range sensitivity decays exponentially with lag. Existing architectures therefore either retrieve from the past in a single read or propagate information through a single feedback chain. We introduce Sessa, a decoder that places attention inside a feedback path, enabling recurrent many-path aggregation within a layer. Under stated assumptions, Sessa admits regimes with a power-law memory tail in lag $\ell$ of order $O(\ell^{-β})$ for $0<β<1$, which is asymptotically slower than $1/\ell$; moreover, this rate is tight in an explicit diffuse uniform-routing setting where the influence is $Θ(\ell^{-β})$. Under the same conditions, only Sessa among the compared model classes realizes flexible selective retrieval, including non-decaying profiles. Empirically, under matched architectures and training budgets, Sessa achieves the strongest performance on our long-context benchmarks while remaining competitive with Transformer and Mamba style baselines on short-context language modeling.

cs.LG cs.AI cs.CL

Sessa: Selective State Space Attention

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Dive

Abstract

Related Papers

Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Efficient learning by implicit exploration in bandit problems with side observations

Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data