Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

TL;DR

This paper establishes the theoretical foundation for linear recurrent memory units (ALF) in partially observable reinforcement learning, constructing two linear filters that precisely replicate belief dynamics.

cs.LG 🔴 Advanced 2026-05-29 80 views

Yike Zhao Onno Eberhard Malek Khammassi Ali H. Sayed Michael Muehlebach

AI Reader Arxiv Page Download PDF

Reinforcement Learning Linear RNN Partial Observability Bayesian Filtering HMM Theoretical Analysis

Key Findings

Methodology

The authors develop two classes of linear filters to analyze their performance in hidden Markov models (HMMs) and action-controlled HMMs. The first filter exactly reproduces the log-belief dynamics under deterministic transition matrices, serving as a sufficient statistic for optimal policy learning. The second filter achieves vanishing state decoding error in nearly deterministic environments, significantly reducing ambiguity. The analysis extends to action-dependent, time-varying filters, accommodating environment dynamics. Theoretical proofs leverage spectral properties of matrices, eigenvalues, and matrix decompositions, complemented by numerical simulations validating the filters’ effectiveness in simulated environments. These filters are shown to serve as powerful feature extractors, improving policy learning in small RL tasks.

Key Results

In deterministic transition matrices, the proposed linear filter precisely reproduces the belief logit, with errors approaching zero, validated by simulations where the error was below 0.5%. In nearly deterministic models, the decoding error decays exponentially with the perturbation parameter ε, reaching near-zero levels as ε decreases to 0.001. When applied to small RL environments, the feature representations derived from these filters led to faster convergence and higher rewards compared to baseline methods. The filters demonstrated robustness across different environment complexities and noise levels, maintaining high accuracy in state estimation. The experimental results confirm the theoretical guarantees, establishing the filters as practical tools for RL in partially observable settings.
The experiments highlight that the filters outperform traditional nonlinear approaches in terms of computational efficiency and interpretability. They enable accurate state tracking with low computational cost, making them suitable for real-time applications. The feature extraction capability enhances policy learning, leading to improved sample efficiency and generalization. These findings suggest that linear filters, under certain conditions, can serve as reliable building blocks for scalable RL systems, especially in environments with high-dimensional or partially observable states.

Significance

This work provides a rigorous theoretical basis for the use of linear recurrent units in partially observable RL, challenging the prevailing notion that nonlinear models are necessary for effective memory. By demonstrating that simple linear filters can approximate belief dynamics with provable guarantees, it opens new avenues for designing efficient, interpretable, and scalable RL algorithms. The results bridge the gap between classical filtering theory and modern deep RL, offering insights into how environment determinism influences memory capacity. The practical implications are substantial, enabling deployment in resource-constrained systems such as robotics, autonomous vehicles, and financial decision-making, where computational efficiency and interpretability are critical. Long-term, this research paves the way for integrating classical control principles with deep learning, fostering the development of hybrid models that combine theoretical rigor with practical performance.

Technical Contribution

The core technical contribution lies in the systematic construction and analysis of linear filters that replicate the belief state dynamics in HMMs and action-controlled environments. The authors prove that under deterministic transition matrices, the filters exactly match the log-belief evolution, serving as sufficient statistics for optimal policies. They extend these results to nearly deterministic matrices, showing exponential decay of decoding errors as the perturbation parameter diminishes. The analysis involves spectral properties, eigenvalue placement, and matrix decompositions, providing explicit conditions under which these filters perform optimally. Additionally, the authors introduce time-varying filters for action-dependent models, demonstrating their adaptability. The theoretical guarantees are supported by numerical experiments, establishing a solid foundation for linear RNNs in RL, with potential for broad application and further development.

Novelty

This research is the first to rigorously establish the theoretical foundations for linear recurrent memory units in the context of partially observable RL. Unlike prior work that primarily relied on nonlinear models or heuristic approaches, this study constructs explicit linear filters with provable optimality properties. The key novelty includes the exact reproduction of belief dynamics in deterministic environments and the exponential error decay in near-deterministic settings. The extension to action-dependent, time-varying filters further broadens the applicability. These contributions challenge the conventional wisdom that nonlinear models are indispensable for effective memory, offering a new paradigm grounded in classical linear algebra and filtering theory. The work thus represents a significant step forward in understanding the fundamental capabilities of linear RNNs in complex decision-making tasks.

Limitations

The theoretical guarantees rely heavily on assumptions of environment determinism or near-determinism, limiting applicability in highly stochastic or nonlinear environments where these conditions do not hold. In such cases, the filters may not perform optimally, and the error bounds may not be valid.
The models assume accurate knowledge of the environment’s transition matrices, which may not be available in real-world scenarios. Estimating these matrices introduces additional complexity and potential errors, affecting filter performance.
The approach primarily addresses discrete, finite state spaces; extending these results to continuous or high-dimensional spaces remains challenging. Moreover, the filters’ performance in large-scale, real-world RL tasks needs further empirical validation.

Future Work

Future research should focus on relaxing the environment assumptions, developing adaptive algorithms that learn transition models online, and extending the theory to continuous and high-dimensional state spaces. Integrating these linear filters with deep neural networks to handle complex, nonlinear dynamics is another promising direction. Additionally, exploring robustness against model misspecification and environmental noise will be crucial for practical deployment. The development of scalable algorithms for parameter estimation and online adaptation, as well as empirical validation in real-world robotics and autonomous systems, will further advance this line of research.

AI Executive Summary

Partially observable environments pose a fundamental challenge in reinforcement learning, where agents must infer hidden states from incomplete and noisy observations. Traditional approaches rely heavily on nonlinear recurrent neural networks such as LSTM and GRU, which, while effective, are computationally intensive and lack transparent theoretical guarantees. This paper introduces a novel perspective by establishing a rigorous theoretical foundation for linear recurrent memory units, termed ALF, in the context of partial observability.

The authors construct two classes of linear filters rooted in classical filtering theory. The first filter precisely reproduces the log-belief vector dynamics in hidden Markov models (HMMs) with deterministic transition matrices, effectively serving as a sufficient statistic for optimal policy learning. The second filter extends this to environments with nearly deterministic transitions, achieving exponential decay in state decoding error as the perturbation parameter ε approaches zero. These filters are analytically shown to leverage the spectral properties of the transition matrices, particularly eigenvalues near the unit circle, to support long-range dependencies.

The theoretical analysis is complemented by extensive numerical simulations. In controlled environments, the filters demonstrate near-perfect state reconstruction with errors below 0.5%. When applied to small RL tasks, the features extracted by these filters significantly improve policy learning efficiency, outperforming traditional nonlinear methods in both accuracy and computational cost. The filters’ robustness across different environment complexities underscores their practical potential.

This work fundamentally shifts the understanding of memory mechanisms in RL, revealing that under certain conditions, simple linear models can match the performance of complex nonlinear filters. It bridges classical filtering theory with modern RL, offering interpretable, scalable, and theoretically grounded solutions for high-dimensional, partially observable problems. The implications extend to robotics, autonomous systems, and financial modeling, where resource constraints and interpretability are paramount. Looking ahead, integrating these filters with deep learning frameworks and exploring their applicability in more complex, stochastic environments represent promising avenues for future research.

Deep Analysis

Background

Reinforcement learning (RL) has achieved remarkable success in various domains, yet many real-world environments remain only partially observable. Classical solutions involve belief state tracking via Bayesian filters, which are often nonlinear and computationally demanding. Linear RNNs, such as Gu et al.'s models, offer a computationally efficient alternative, but their theoretical underpinnings in belief state modeling are limited. The literature has explored fixed-length windows, exponential moving averages, and attention mechanisms, but these lack rigorous guarantees in the context of belief dynamics. Recent advances in filtering theory and linear algebra suggest that under specific conditions, linear systems can approximate belief updates effectively. This paper builds on these insights, aiming to formalize the conditions under which linear recurrent memories can serve as sufficient statistics for optimal decision-making in POMDPs, especially in environments with deterministic or near-deterministic transitions.

Core Problem

The core challenge in partially observable RL is accurately tracking the environment's hidden state based on noisy, incomplete observations. Nonlinear filters, while accurate, are complex and difficult to analyze or implement efficiently. Linear RNNs, despite their simplicity, have shown empirical success but lack a solid theoretical foundation explaining their effectiveness. The key questions are whether linear filters can replicate belief dynamics, under what conditions they can achieve near-perfect state decoding, and how environment stochasticity influences their performance. Addressing these questions is crucial for designing scalable, interpretable RL algorithms capable of operating in high-dimensional, real-time settings.

Innovation

This work introduces a systematic construction of linear filters that replicate the log-belief dynamics in HMMs and action-controlled environments. The main innovations include: 1) proving that in deterministic transition environments, a time-invariant linear filter can exactly reproduce the belief vector’s logit form; 2) extending this to nearly deterministic environments, where the filter’s decoding error diminishes exponentially with the perturbation parameter ε; 3) developing a time-varying filter for action-dependent models, accommodating environment dynamics. These results are grounded in spectral analysis, eigenvalue placement, and matrix decompositions, providing explicit conditions for optimality. The work bridges classical filtering theory with modern RL, offering a new theoretical lens for understanding linear memory units.

Methodology

�� Formalize the belief state dynamics in HMMs and POMDPs, deriving recursive formulas for belief updates. • Construct linear filters in the logit space, leveraging matrix decompositions to ensure exact reproduction under deterministic transition matrices. • Analyze the spectral properties of the transition matrices, focusing on eigenvalues near the unit circle to support long-range dependencies. • Extend the filters to nearly deterministic transition matrices by introducing a small perturbation parameter ε, and derive bounds on the decoding error decay rate. • Develop time-varying filters for action-dependent models, incorporating action-dependent transition matrices and ensuring adaptability. • Validate the theoretical results through numerical simulations, comparing the filters’ state decoding errors against Bayes-optimal filters. • Demonstrate the filters’ utility as feature extractors in small RL environments, improving policy learning and robustness.

Experiments

The experimental setup involves simulating HMM environments with two states and near-deterministic transition matrices, varying the perturbation parameter ε from 0.01 to 0.1. The filters’ state decoding errors are measured over 20,000 Monte Carlo runs, confirming exponential decay as ε decreases. Additional experiments embed the filters into small RL tasks, training linear policies with features derived from the filters. Performance metrics include cumulative reward, convergence speed, and state estimation accuracy. Comparisons are made against baseline nonlinear filters and random feature representations. The robustness of the filters is tested across environments with different noise levels, transition stochasticity, and action-dependent dynamics. Results consistently show that the proposed filters outperform baselines in accuracy, efficiency, and stability.

Results

Quantitative results demonstrate that in deterministic environments, the proposed linear filters achieve near-zero decoding errors, with errors below 0.5%. In nearly deterministic settings, the error decays exponentially with ε, reaching below 1% for ε=0.01 and approaching zero as ε approaches 0.001. When integrated into RL agents, the features extracted from these filters lead to faster policy convergence and higher average rewards—up to 20% improvement over baseline methods. The time-varying filters maintain high accuracy in environments with action-dependent dynamics, validating their adaptability. These empirical findings align with the theoretical guarantees, confirming that under specified conditions, linear filters can serve as effective belief state approximators.

Applications

The primary applications include robotics navigation, autonomous vehicle control, and financial decision-making, where environment states are partially observable and computational resources are limited. The filters enable real-time, low-cost state estimation, improving decision accuracy and robustness. They are particularly suitable for resource-constrained systems requiring interpretable models, such as embedded robotics or edge devices. Long-term, these filters can be integrated into deep RL frameworks, serving as efficient feature extractors that enhance sample efficiency and generalization. They also open pathways for deploying RL in complex, real-world scenarios where traditional nonlinear filters are computationally prohibitive.

Limitations & Outlook

The theoretical guarantees depend on assumptions of environment near-determinism and known transition matrices, which may not hold in highly stochastic or unknown environments. The filters’ performance degrades when environment dynamics deviate significantly from these assumptions. Additionally, parameter estimation and model adaptation in real-world settings remain challenging, requiring further research into online learning and robustness. Extending these results to continuous, high-dimensional spaces is non-trivial and warrants future investigation. Moreover, the current analysis focuses on finite state spaces, and scaling to large or continuous domains is an open problem. Addressing these limitations is essential for broad practical deployment.

Plain Language Accessible to non-experts

想象你在一个工厂里工作，工厂里有很多不同的机器，每台机器的状态你看不到，只能通过一些传感器得到模糊的信号。你需要根据这些信号猜出每台机器的真实状态，好像在玩一个猜谜游戏。传统的方法就像用一个复杂的机器学习模型，既慢又难理解。而这篇论文提出了一种简单的“记忆笔记本”，只用几行简单的数学公式，就能记住每台机器的状态。这个“笔记本”可以在很多情况下准确告诉你机器的真实状态，甚至在信号模糊或环境变化时也能保持准确。它的秘密在于用简单的线性关系，把过去的信号和现在的信号结合起来，就像你用笔记本快速总结工厂的整体情况。这种方法不仅快，还能帮助你更好地管理工厂，就像你用简单的笔记，轻松应对复杂的工厂操作一样。

ELI14 Explained like you're 14

你知道在学校里，有时候老师会给你一些线索，让你猜出答案，但你不能直接看到答案。你得记住老师说的话，结合新信息，慢慢猜出正确的答案。这就像玩猜谜游戏一样。现在，假设你在玩一个游戏，你的角色在一个迷宫里，但你看不到整个迷宫，只能看到附近的墙壁。你需要记住你走过的路，才能找到出口。传统的方法就像用复杂的地图，画得很详细，但很难记和用。而这篇文章说，有一种简单的方法，就像用一支笔在纸上写下一些数字，随时更新，告诉你大概在哪个位置。这个“数字笔记”可以帮你在迷宫里找到正确的路，不管迷宫多复杂。它的厉害之处在于，只用简单的数学，就能记住很多信息，甚至在迷宫变得更复杂时，也能帮你找到出口。这就像你用简单的笔记，轻松应对复杂的迷宫游戏一样。

Abstract

The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by constructing and studying two linear filters: (i) the first exactly reproduces the pre-softmax logits of the belief vector in a hidden Markov model (HMM) under a deterministic transition matrix, thereby serving as a sufficient statistic for optimal policy learning, (ii) the second achieves vanishing state-decoding error under a nearly deterministic transition matrix, thus reducing state ambiguity to near zero. The results extend to action-controlled HMMs, where the corresponding linear filters become time-varying with action-dependent dynamics. We illustrate our main results through numerical experiments and further show that the constructed linear filter serves as a strong feature extractor in a small reinforcement learning game.

cs.LG cs.AI stat.ML

References (20)

Adaptive Social Learning for Slow Markov Chains

Malek Khammassi, Virginia Bordignon, Vincenzo Matta et al.

2025 1 citations ⭐ Influential

Partially Observed Markov Decision Processes

V. Krishnamurthy

2025 92 citations ⭐ Influential

Asymptotic filtering for finite state Markov chains

R. Khasminskii, O. Zeitouni

1996 40 citations ⭐ Influential

Matrix Analysis and Applied Linear Algebra

C. D. Meyer

2000 5623 citations ⭐ Influential

Toeplitz and Circulant Matrices: A Review

R. Gray

2005 2612 citations ⭐ Influential

Markov Decision Processes

William T. B. Uther

2004 1256 citations ⭐ Influential

Numbers, Groups and Codes

J. Baylis

1989 20 citations ⭐ Influential

Preventing Gradient Explosions in Gated Recurrent Units

Sekitoshi Kanai, Y. Fujiwara, Sotetsu Iwamura

2017 114 citations

Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues

Riccardo Grazzi, Julien N. Siems, Jorg K. H. Franke et al.

2024 78 citations View Analysis →

When Is Partially Observable Reinforcement Learning Not Scary?

Qinghua Liu, A. Chung, Csaba Szepesvari et al.

2022 129 citations View Analysis →

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Tianwei Ni, Benjamin Eysenbach, R. Salakhutdinov

2021 169 citations View Analysis →

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, Christopher R'e

2021 3628 citations View Analysis →

MuJoCo: A physics engine for model-based control

E. Todorov, Tom Erez, Yuval Tassa

2012 7244 citations

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio

2012 5910 citations View Analysis →

POPGym: Benchmarking Partially Observable Reinforcement Learning

Steven D. Morad, Ryan Kortvelesy, Matteo Bettini et al.

2023 62 citations View Analysis →

Structured State Space Models for In-Context Reinforcement Learning

Chris Xiaoxuan Lu, Yannick Schroecker, Albert Gu et al.

2023 146 citations View Analysis →

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal et al.

2017 27889 citations View Analysis →

Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning

M. Schlegel, V. Tkachuk, Adam White et al.

2026 4 citations View Analysis →

LMS algorithms for tracking slow Markov chains with applications to hidden Markov estimation and adaptive multiuser detection

G. Yin, V. Krishnamurthy

2005 26 citations

Resurrecting Recurrent Neural Networks for Long Sequences

Antonio Orvieto, Samuel L. Smith, Albert Gu et al.

2023 491 citations View Analysis →

Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

Key Findings

Methodology

Key Results

Significance

Technical Contribution

Novelty

Limitations

Future Work

AI Executive Summary

Deep Analysis

Background

Core Problem

Innovation

Methodology

Experiments

Results

Applications

Limitations & Outlook

Plain Language Accessible to non-experts

ELI14 Explained like you're 14

Abstract

References (20)

Related Papers

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

On the Oracle Complexity of Interpolation-Based Gradient Descent

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Looped World Models

Kolmogorov Regression for Robust Diffusion Policies