Solve the Loop: Attractor Models for Language and Reasoning
Attractor Models enhance language modeling and reasoning via fixed-point solving, improving training efficiency by 46.6% and accuracy by 19.7%.
Key Findings
Methodology
The paper introduces Attractor Models, a novel architecture that treats latent refinement as a fixed-point problem in the output embedding space. The model first proposes an initial guess embedding using a non-recurrent backbone module (implemented as a Transformer). A separate, typically smaller, recurrent network then refines this guess. Gradients are obtained through implicit differentiation, keeping training memory constant while iterations are adaptively chosen based on convergence.
Key Results
- In large-scale language modeling, Attractor Models outperform parameter-matched Transformers and stable looped models across sizes (140M, 370M, 770M), achieving better validation perplexity, Lambada perplexity, and downstream benchmark accuracy while using significantly less training compute. Notably, a 770M Attractor Model surpasses a 1.3B Transformer trained on twice as many tokens.
- In challenging reasoning tasks, Attractor Models achieve 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard with only 27M parameters and approximately 1000 training examples, outperforming standard Transformers and frontier LLMs like DeepSeek R1, Claude, and o3-mini, which fail completely.
- Attractor Models exhibit a novel phenomenon called equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation.
Significance
Attractor Models hold significant implications for language modeling and reasoning. They not only achieve substantial improvements in perplexity and accuracy in large-scale language modeling but also excel in small-data reasoning tasks, addressing long-standing challenges of training stability and computational cost in recurrent architectures. By turning recurrence into a computation the model can learn to internalize, Attractor Models make iterative refinement scalable, offering new insights for academia and industry.
Technical Contribution
The technical contribution of Attractor Models lies in treating latent refinement as a fixed-point problem and obtaining gradients through implicit differentiation. This approach differs from traditional recurrent architectures by avoiding unstable training processes and linearly growing memory requirements. By adaptively choosing the number of iterations, Attractor Models achieve efficient computation during both training and inference, significantly reducing computational costs.
Novelty
The innovation of Attractor Models lies in transforming the latent refinement process in recurrent architectures into a fixed-point solving problem. This method is the first to achieve significant performance improvements in language modeling and reasoning tasks without increasing computational costs. Compared to existing recurrent architectures, Attractor Models achieve more stable and efficient training through implicit differentiation and adaptive iteration selection.
Limitations
- Attractor Models may require longer convergence times on certain complex tasks, especially when the initial embedding is far from the equilibrium point.
- The computational complexity of implicit differentiation may limit the model's application in some scenarios due to resource constraints.
- The model's performance on specific tasks may be limited by the diversity and scale of the training data.
Future Work
Future research directions include exploring the application of Attractor Models to more tasks and datasets, further optimizing their convergence speed and computational efficiency across different tasks. Additionally, integrating other advanced architectures and techniques, such as graph neural networks and self-supervised learning, could further enhance the model's performance and adaptability.
AI Executive Summary
In the modern era of language modeling, Transformer models have dominated due to their fixed feed-forward computation. However, this approach relies on a single pass of computation for each token, lacking the ability to refine latent predictions before committing to an output. Attractor Models introduce a novel approach by incorporating fixed-point solving into recurrent architectures for language modeling and reasoning.
Attractor Models consist of two modules: a non-recurrent backbone module and a recurrent attractor module. The backbone module first proposes an initial output embedding, which the attractor module refines through fixed-point iteration. Gradients are obtained via implicit differentiation, keeping training memory constant while iterations are adaptively chosen based on convergence. This method not only enhances training efficiency but also significantly improves performance in language modeling and reasoning tasks.
In experiments, Attractor Models excel in both large-scale language modeling and challenging reasoning tasks. Notably, in Sudoku-Extreme and Maze-Hard tasks, Attractor Models achieve 91.4% and 93.1% accuracy, respectively, with only 27M parameters and approximately 1000 training examples, outperforming frontier models like DeepSeek R1, Claude, and o3-mini.
A key contribution of Attractor Models is the phenomenon of equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. This indicates that Attractor Models can turn recurrence into a computation the model can learn to internalize, making iterative refinement scalable.
Despite their impressive performance across various tasks, Attractor Models may require longer convergence times on certain complex tasks. Additionally, the computational complexity of implicit differentiation may pose challenges in resource-constrained environments. Future research directions include exploring the application of Attractor Models to more tasks and datasets, further optimizing their convergence speed and computational efficiency.
Deep Analysis
Background
In recent years, Transformer models have become the mainstream due to their outstanding performance in language modeling. However, the fixed feed-forward computation of Transformer models limits their ability to refine latent predictions before generating each token. As the demands for language modeling and reasoning tasks increase, researchers have begun exploring the potential of recurrent architectures to achieve more efficient computation and more accurate predictions. Recurrent architectures offer a possible solution by iteratively refining latent representations before generating outputs. However, traditional recurrent architectures face challenges in training stability and computational cost, prompting researchers to explore new methods to overcome these limitations.
Core Problem
The success of Transformer models in language modeling has masked their shortcomings in latent refinement capabilities. The fixed feed-forward computation means that each token generation relies on a single pass of computation, without refining latent predictions before output. This limitation is particularly evident in complex reasoning tasks, which often require multiple iterative computations for precise results. Additionally, traditional recurrent architectures face challenges in training stability and computational cost, limiting their widespread adoption in practical applications.
Innovation
Attractor Models innovate by treating the latent refinement process as a fixed-point problem in the output embedding space. First, the model uses a non-recurrent backbone module to propose an initial guess embedding, which is then refined by a separate, typically smaller, recurrent network. Gradients are obtained through implicit differentiation, keeping training memory constant while iterations are adaptively chosen based on convergence. This approach not only enhances training efficiency but also significantly improves performance in language modeling and reasoning tasks.
Methodology
- �� Attractor Models consist of two modules: a non-recurrent backbone module and a recurrent attractor module.
- �� The backbone module first proposes an initial output embedding, which the attractor module refines through fixed-point iteration.
- �� Gradients are obtained via implicit differentiation, keeping training memory constant while iterations are adaptively chosen based on convergence.
- �� During inference, the solver can be removed with little performance degradation, thanks to equilibrium internalization.
Experiments
The experimental design includes evaluating Attractor Models on large-scale language modeling and challenging reasoning tasks. In language modeling, models with 140M, 370M, and 770M parameters are compared for their performance on validation perplexity, Lambada perplexity, and downstream benchmark accuracy. In reasoning tasks, Sudoku-Extreme and Maze-Hard are chosen as benchmarks to assess the performance of Attractor Models on small datasets. The experiments also include comparisons with standard Transformers and frontier LLMs to validate the superiority of Attractor Models.
Results
The experimental results show that Attractor Models excel in both large-scale language modeling and challenging reasoning tasks. In language modeling, Attractor Models outperform parameter-matched Transformers and stable looped models across sizes (140M, 370M, 770M), achieving better validation perplexity, Lambada perplexity, and downstream benchmark accuracy while using significantly less training compute. In reasoning tasks, Attractor Models achieve 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard with only 27M parameters and approximately 1000 training examples.
Applications
Attractor Models have wide application potential in language modeling and reasoning tasks. In language modeling, they can be used to improve the quality and efficiency of text generation, especially in long-text generation and complex context understanding. In reasoning tasks, they can be used to solve complex logical reasoning problems, such as Sudoku and maze solving. Additionally, the efficient computation and stability of Attractor Models make them advantageous in resource-constrained environments.
Limitations & Outlook
Despite their impressive performance across various tasks, Attractor Models may require longer convergence times on certain complex tasks. Additionally, the computational complexity of implicit differentiation may pose challenges in resource-constrained environments. The model's performance on specific tasks may be limited by the diversity and scale of the training data. Future research directions include exploring the application of Attractor Models to more tasks and datasets, further optimizing their convergence speed and computational efficiency.
Plain Language Accessible to non-experts
Imagine you're cooking in a kitchen. Traditional Transformer models are like a chef who follows a recipe step by step without thinking, completing each step mechanically. Attractor Models, on the other hand, are like an experienced chef who tastes and adjusts at each step until the dish is perfect. This experienced chef pauses at each step, tastes, and adjusts based on the feedback. This method not only improves the quality of the dish but also reduces waste, as the chef can correct mistakes in time. Attractor Models treat this iterative adjustment process as a fixed-point problem, allowing the model to reach the optimal state for each token generation. In this way, Attractor Models not only improve performance in language modeling and reasoning tasks but also significantly reduce computational costs.
ELI14 Explained like you're 14
Hey there! Have you ever played a game where you need to keep adjusting your strategy? Like, you have to keep trying different paths until you find the best one. Attractor Models are like a super-smart gamer who stops to think after each attempt and adjusts their strategy based on past experiences. This way, they always find the fastest and easiest way to win! Traditional models, on the other hand, are like a player who only follows fixed steps and might get stuck on some levels because they don't adjust their strategy. Attractor Models excel in language modeling and can easily handle complex reasoning tasks with this iterative adjustment method. Isn't that cool?
Glossary
Transformer
A deep learning model used for natural language processing that employs self-attention mechanisms to process input data.
In the paper, Transformers are used as the backbone module of Attractor Models.
Fixed Point
In mathematics, a fixed point of a function is a point that is mapped to itself by the function.
Attractor Models treat the latent refinement process as a fixed-point problem in the output embedding space.
Implicit Differentiation
A method of finding derivatives of functions defined implicitly, used to compute gradients.
In Attractor Models, implicit differentiation is used to obtain gradients, keeping training memory constant.
Perplexity
A metric used to evaluate the performance of language models; lower values indicate more accurate predictions.
In experiments, perplexity is used to assess the performance of Attractor Models in language modeling tasks.
Sudoku-Extreme
A complex variant of Sudoku, typically used to test the reasoning capabilities of models.
In the paper, Sudoku-Extreme is used as a benchmark task to evaluate the reasoning capabilities of Attractor Models.
Maze-Hard
A complex maze-solving task used to test the reasoning capabilities of models.
In the paper, Maze-Hard is used as a benchmark task to evaluate the reasoning capabilities of Attractor Models.
Equilibrium Internalization
A phenomenon where fixed-point training places the model's initial output embedding near equilibrium.
In Attractor Models, equilibrium internalization allows the solver to be removed at inference time with little performance degradation.
Anderson Acceleration
A technique used to accelerate fixed-point iteration by combining past iterates and residuals to reach the fixed point faster.
In Attractor Models, Anderson acceleration is used to improve the convergence speed of the solver.
Deep Equilibrium Model
A model that predicts by solving for a fixed point of hidden states.
Attractor Models are inspired by Deep Equilibrium Models but solve for fixed points in the output embedding space.
Root Finder
An algorithm used to find the zeros of a function.
In Attractor Models, a root finder is used to compute the equilibrium of the output embedding.
Open Questions Unanswered questions from this research
- 1 The convergence speed of Attractor Models on certain complex tasks remains an open question. Although the model performs well on most tasks, it may require longer convergence times when the initial embedding is far from the equilibrium point. Future research could explore how to optimize the model's initial embedding to accelerate the convergence process.
- 2 The computational complexity of implicit differentiation limits the application of Attractor Models in some scenarios. Although implicit differentiation has advantages in maintaining constant training memory, its computational complexity may pose challenges in resource-constrained environments. Research on simplifying the computation process of implicit differentiation could be an important future direction.
- 3 The performance of Attractor Models on specific tasks may be limited by the diversity and scale of the training data. Although the model performs well on small datasets, it may not fully realize its potential when data diversity is insufficient. Future research could explore how to enhance the model's generalization ability through data augmentation and transfer learning.
- 4 Although the phenomenon of equilibrium internalization in Attractor Models performs well in most tasks, its applicability across different tasks and datasets requires further validation. Research on effectively achieving equilibrium internalization in different tasks could be an important future direction.
- 5 Despite the excellent performance of Attractor Models in reasoning tasks, their application potential in other fields remains to be explored. Research on applying Attractor Models to image processing, speech recognition, and other fields could provide new application scenarios and development directions.
Applications
Immediate Applications
Text Generation
Attractor Models can be used to improve the quality and efficiency of text generation, especially in long-text generation and complex context understanding.
Logical Reasoning
In reasoning tasks, Attractor Models can be used to solve complex logical reasoning problems, such as Sudoku and maze solving.
Resource-Constrained Environments
The efficient computation and stability of Attractor Models make them advantageous in resource-constrained environments, suitable for mobile devices and embedded systems.
Long-term Vision
Cross-Domain Applications
Research on applying Attractor Models to image processing, speech recognition, and other fields could provide new application scenarios and development directions.
Automated Decision Systems
The reasoning capabilities of Attractor Models can be used to develop automated decision systems, improving decision accuracy and efficiency.
Abstract
Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.
References (20)
Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought
Jianhao Huang, Zixuan Wang, Jason D. Lee
Parcae: Scaling Laws For Stable Looped Language Models
Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick et al.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Jonas Geiping, Sean McLeish, Neel Jain et al.
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou et al.
Attention is All you Need
Ashish Vaswani, Noam Shazeer, Niki Parmar et al.
Scaling Latent Reasoning via Looped Language Models
Ruiming Zhu, Zixuan Wang, Kai Hua et al.
Looped Transformers as Programmable Computers
Angeliki Giannou, Shashank Rajput, Jy-yong Sohn et al.
Loop, Think,&Generalize: Implicit Reasoning in Recurrent-Depth Transformers
Harsh Kohli, Srinivasan Parthasarathy, Huan Sun et al.
Deep Equilibrium Models
Shaojie Bai, J. Kolter, V. Koltun
A Mechanistic Analysis of Looped Reasoning Language Models
Hugh Blayney, 'Alvaro Arroyo, Johan S. Obando-Ceron et al.
Iterative Procedures for Nonlinear Integral Equations
Donald G. M. Anderson
LoopRPT: Reinforcement Pre-Training for Looped Language Models
Guo Tang, Shixin Jiang, Heng Chang et al.
Multiscale Deep Equilibrium Models
Shaojie Bai, V. Koltun, J. Kolter
Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence
Sean McLeish, Ang Li, John Kirchenbauer et al.
JFB: Jacobian-Free Backpropagation for Implicit Networks
Samy Wu Fung, Howard Heaton, Qiuwei Li et al.
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Sangmin Bae, Yujin Kim, Reza Bayat et al.
Understanding Dynamic Compute Allocation in Recurrent Transformers
Ibraheem Muhammad Moosa, Suhas Lohit, Ye Wang et al.
Reasoning with Latent Thoughts: On the Power of Looped Transformers
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li et al.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Chris Cameron, Wangzheng Wang, N. Ivanov et al.
Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models
Jonathan Williams, Esin Tureci